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Statement of Results 

We propose a modification in the definition of reversals, a complexity measure on the deter- 
ministic Turing machine. To distinguish our definition from the classical definition [Hartma- 
nisl968] [KamedaVollmarl970], we refer to our definition as identifying a new resource which 
is a systolically strict variant of reversals and which we call mode transitions. With mode 
transitions as the analogue of parallel time and workspace as the analogue of hardware, we 
give an “efficient” simulation of the DTM on the PRAM. In fact, while the parallel time used 
is polynomially related to mode transitions (not an improvement over [Pippengerl979]), we 
show that simultaneously the hardware used is linearly related to workspace (whereas [Pip- 
pengerl979] gives an arbitrary polynomial blowup). At the same time, we show that the sim- 
ulation of PRAMs on the DTM (which uses mode transitions) is no worse than the previously 
known simulation (which uses reversals). One implication of this is that sequential-access ori- 
ented algorithms that use few mode transitions can be easily translated into “efficient” parallel 
algorithms. We also explore the implications of the hypothesis that PRAMs can be simu- 
lated “efficiently” on the DTM (using mode transitions). We show that this would tighten 
the correspondence between the DTM and the PRAM so much that it would be comparable 
to the orthodox interpretation of the Invariance Thesis [VanEmdeBoasHandbookl990] and 
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we propose a Unified Invariance Thesis to express this fact. 


The approach of this study 

For every tape the “mode” of the tape is defined as the pattern of 0(1) most recent head 
movements. (The length of this pattern is the same for all tapes and is called the period or 
cycle-time of the DTM.) A head movement is said to be expected if it cyclically repeats this 
pattern. If there are no unexpected head movements, then given the iteration count (of the 
unidirectional scan) and the modes of all tapes, the locations of all heads are fixed. Every 
(unexpected) head movement costs separately (it is called a mode transition). 

First, the simulation of DTMs on PRAMs is presented. The cases of narrow and shallow 
computations (in the sense of circuit complexity) are considered separately. For narrow com- 
putations, the simulation is straightforward. For shallow computations, a nearly cost-optimal 
parallel algorithm to compute the transitive closure of bounded- width layered directed acyclic 
graphs is used. 

Next, PRAM computations are expressed in logic. A family of extensions of first order 
logic is proposed. All extensions are based on a single new operator Y(T,S,V) which takes 
parameters time domain T, space domain S and variance domain V. Each is one of several 
domains of different sizes that are part of the structure. For each choice of these parameters, 
formulae express languages and the extended calculus expresses a complexity class. Each 
formula has a time arity, a space arity and a variance arity. Certain calculii of this family 
have a “standard form”; i.e. for every formula of the calculus, one can construct an equiv- 
alent formula which has the syntactic properties of the standard form. It is shown that for 
PRAM computations of interest, an equivalent formula in standard form can be constructed, 
such that the domain sizes and arities syntactically represent the complexity of the PRAM 
computation modulo polynomial factors. 

Finally, this formula is evaluated on the DTM. The workspace and mode transitions re- 
quired depend on the domain sizes and arities of the formula. The proof adapts techniques 



from research on interconnection networks for parallel computing to Turing machine compu- 
tations. Note that k-ary predicates over a domain of size n can be viewed as binary strings 
of length n* stored on a DTM tape. A k-tuple of variables can be viewed as a position of the 
tape head. A predefined function like successor can be viewed as a map from head positions 
to head positions. The key idea is that k-ary predicates can also be viewed as the contents 
of a one-bit register in a n*-processor fixed connection network. A k-tuple of variables can 
be viewed as a processor index. A predefined function like successor can be viewed as a data 
transformation to be achieved by routing messages over the network using the properties 
of the interconnection functions. The way this key idea is applied here is by interpreting 
familiar interconnection functions as functions over binary strings and showing that these are 
efficiently implementable on the DTM. 

The result is proved first for conventional resource bounds; i.e. resource bounds of the 
form 0(n*'), 0((log(n))*) and O’'((log(n))*) for some integer k>l, where n is the size of the 
input. This establishes the proof technique. Then the terms constructibility on simultaneous 
complexity models and acceptable resource bounds axe defined and the results are extended to 
acceptable resource bounds. 


Application to database query processing 

The paradigms of parallel algorithm design are applied to recursive query processing in 
databases. Roughly, an efficient recursive query processing algorithm is a polynomial time 
algorithm which performs only 0(polylog(|R|)) file operations on a database consisting of a 
relation R and stored <is a file of records (with a suitable definition of “file operation”; an 
index or sort command on a file of size n takes 0(log(n)) file operations while a sequence 
of next-record commands or a sequence of previous-record commands or a rewind command 
takes one file operation.) Efficient algorithms are not possible for arbitrary queries. However, 
since a directed graph with vertices restricted to outdegree 1 may be thought of as a func- 
tional dependency, and very efficient parallel algorithms for finding least common ancestors 
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in such graphs are known, this is possible at least in some special cases. For one subclass of 
queries, an efficient sequential algorithm that uses few file operations is obtained by analogy 
with the corresponding parallel algorithm. 
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Chapter 1 

Introduction 

1.1 History and Motivation 

In complexity theory, we ftx a model of computation and specify the complexity measures 
to be used. Though equivalence through mutual simulation enabled the treatment of the 
notion of (un)solvability as a model-independent mathematical principle (Church’s Thesis), 
models which were shown to be computationally equivalent turned out to have differing 
properties with respect to specific complexity measures (typically, space and time). In the 
last three decades, model-independent complexity notions have come to light which depend 
on equivalence through mutual simulation within resource-bounded overhead. Analogous to 
Church’s Thesis for computability theory, we have the Invariance Thesis for complexity the- 
ory [VanEmdeBoasHandbookl990]: “Reasonable” machines can simulate each other within a 
polynomially bounded overhead in time and simultaneously (as per the orthodox interpreta- 
tion) a constant-factor overhead in space. A variant of Turing’s original model, deterministic 
Turing machines with a two-way read-only input tape, multiple two-way read-write worktapes 
and a one-way write-only output tape (abbreviated to “Turing machine” hereafter), turned 
out to be equivalent in the sense of the Invariance Thesis to an idealised von Neumann 
computer, the random access stored program machine with logarithmic cost (abbreviated to 
“RAM” hereafter)[VanEmdeBoasHandbookl990] and this has encouraged the rapid adap- 


1 



CHAPTER 1. INTRODUCTION 


2 


tation of ideas from computability theory (efficient universal machines, recursion schemas, 
program transformations) to computing systems [ValiantHandbookl990]. 

The situation in parallel computing is quite different. Practical parallel computing 
has evolved from research on telephone networks, fast Fourier transforms, sorting, celluler 
logic and fault-tolerant multiprocessing ([Stonel971], [Fengl974], [Lawriel975], [Daviol981], 
[WuRosenfeldl981] and the references therein), and has only recently crystallised into a 
coherent study of general-purpose parallel architectures [Siegel ’sBook 1990] [ValiantHand- 
bookl990] [Blelloch’sBookl990]. A variant of the parallel random access machine (PRAM) 
[ValiantHandbookl990] and a parallel vector scan machine (PVSM) [Blelloch’sBookl990] 
have been proposed as “appropriate aspiration(s) for the parallel computer architect much 
as the von Neumann model is in the sequential case” [ValiantHandbookl990]. Resource- 
bounded simulations between parallel computation models have been studied by many 
researchers [Siegel’sBookl990] [KarpRamachandranHandbookl990] [ValiantHandbookl990] 
[Blelloch’sBookl990]. In one approach, the number of processors available P is fixed. If a 
program written to run in t(n) parallel time with p(n) processors runs in T parallel time 
when simulated on a P processor machine, the efficiency of the simulation is said to be a 
lower bound on the ratio (p(n)t(n))/PT. The aim is to achieve constant efficiency for any P 
< p(n)/log(n) [ValiantHandbookl990]. In another approach, the following definition of an 
“efficient” parallel algorithm is used [KarpRamachandranHandbookl990]: If the fastest se- 
quential algorithm for a problem runs in time 0(T(n)) and a given parallel algorithm A runs 
in parallel time t(n) with p(n) processors, then A is efficient if t(n)=0(polylog(n)) and the 
work w(n)=p(n)t(n) is 0(T(n)polylog(n)). Simulations between some parallel computation 
models make the notion of an efficient algorithm invariant with respect to the particular model 
used. A third approach shows that certain parallel computation models are approximately 
equivalent in that if a given problem can be solved on one of these models in parallel time 
T(n) using P(n) processors, then it can be solved on any of the other models in parallel time 
poly(T(n)log(P(n))) using poly(P(n)T(n)) processors. Some pairs of models are equivalent in 
a tighter sense; their ability to solve problems in O(log(n)*) parallel time using polynomially 
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bounded hardware is identical for each fixed positive integer k [KarpRamachandranliand- 
bookl990]. Some problems are known to be in NC, the class of problems which can be solved 
in polylog parallel time using polynomially bounded hardware [Pippengerl979] [Cookl985], 
but are not known to have NC algorithms that are within a polylog factor of the best known 
lower bound on the work (=processor-time product) [Coppersmith Winogradl982]. Thus, 
there are three commonly studied degrees of parallelism [Pippengerl987]; the case of a single 
or a fixed number p of processors, called the serial case, the case of a processor-time product 
(nearly) equal to the time bound of the best known sequential algorithm or the best known 
lower bound on such an algorithm, called the balanced case and the case of a number of 
processors large enough (while still polynomially bounded or at least reasonable in the sense 
of the Parallel Computation Thesis; see below) to allow the computation to be performed in 
the minimum possible parallel time, called the highly parallel case. 

Much of the work in resource-bounded simulations between parallel computation models 
can be classified into these three categories. The fine structure of the class NC is usually 
studied in terms of the classes NC*, k>l, where NC* is the class of problems solvable in 
0(log(n)*) time using polynomially bounded hardware. NC* as defined above is not robust, 
and there has been some discussion about the “correct” model on which NC* (particularly 
for k=l) is to be defined ([Borodinl977], [Ruzzol981], [Cookl985], [CookHooverl985], [Al- 
lenderl985], [Allenderl989], [ComptonLaFlammel988]). All this work focuses on obtaining 
minimal achievable parallel time (i.e. the highly parallel case mentioned above), while per- 
mitting any polynomial bound on the hardware. As a result, these efforts, though closer 
in spirit to classical complexity theory, have remained outside the pale of “efficient" paral- 
lel algorithms, and do not appear to have directly contributed to the ongoing research on 
general purpose parallel architectures and efficient universal parallel machines. (Not one 
of these papers is cited in [ValiantHandbookl990].) On the other hand, many of the pa- 
pers which do present “efficient” simulations between parallel computation models tend to 
draw inspiration as much from computer engineering and algorithm design as from classi- 
cal complexity theory [Siegell979], [NassimiSahnil98lb], [LevPippengerVaIiantl981], [Galil- 
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Paul 1983], [BorodinHopcroftl985] [AltHagerupMehIhornPreparatal987],[BIellochl989]. (All 
of these are cited in [ValiantHandbookl990].) These, along with some results mentioned in 
[Valiantl976], [FichRagdeWigdersonl988] and [HerleyBilardil988], have set a defacto mini- 
mum standard for theories of practical relevance. 

Researchers working on the highly parallel case have made several attempts to com- 
pare parallel computation models with sequential ones, particularly the Turing machine 
[PrattStockmeyerl976] [Borodinl977] [Pippengerl979] [Dymondl980] [Goldschlagerl982] 
[Hongl984a] [Hongl984b] [Parberryl986] [ParberrySchnitgerl988] [VanEmdeBoasHand- 
bookl990] [KarpRamachandranHandbookl990]. Two remarkable results ([PrattStock- 
meyerl976], [Pippengerl979]) have led to the articulation of (respectively) two Parallel Com- 
putation Theses, which propose to define “reasonable” parallel computation models as those 
for which an analogous result can be obtained. Both results show that parallel computation 
models and Turing machines can simulate each other with polynomially bounded overhead. 
The simulation of Turing machines on parallel computation models is conceptually similar in 
the two cases: by representing (partial) ids as nodes and transitions as edges, the problem 
of simulation is reduced to that of pathfinding in graphs. (The encodings and the simula- 
tion overheads differ: [PrattStockm€yerl976] deals with the case when only a space bound is 
known while [Pippengerl979] deals with the case when a runtime bound as well as additional 
information in the form of a reversal bound is available.) However the reverse simulations are 
conceptually different. In [PrattStockmeyerl976] a recursive depthfirst search of the circuit 
or self-organizing network gives a simulation that uses workspace polynomially bounded by 
the circuit depth (for stacks) but, because it has to revisit the nodes in general, requires 
worstcase runtime exponential in the circuit depth even when the circuit size is small. In 
[Pippengerl979], the nodes of the circuit are represented by fixed tape cells (like callers on 
a telephone network), while the simulation strategy uses sorting techniques to route packets 
from source to destination in parallel for all such pairs. This gives a simulation that is it- 
erative rather than recursive: the runtime required is polynomially bounded by the circuit 
size, the result of simulating one parallel timestep is a continuation that describes the state 
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of the parallel computation after the step, and the number of reversals is bounded by the 
product of the number of parallel timesteps and a polynomial in the logarithm of the circuit 
size. Thus the first approach [PrattStockmeyerl976] enables us to show that parallel time 
is polynomially related to sequential workspace and the second approach [Pippengerl979] 
enables us to show that parallel time is polynomially related to reversals and simultaneously 
parallel hardware is polynomially related to sequential runtime. 

However, this work also has not been directly applied to the study of efficient universal par- 
allel machines. (Neither [PrattStockmeyerl976] nor [Pippengerl979] is cited in [ValiantHand- 
bookl990].) The various perspectives have failed to mesh, leading to a widening gap between 
the “machine-independence” culture of classical theory and the “balanced parallelism” culture 
of architecture and algorithm design. The hope, that parallelism, interpreted as simultaneous 
resource bounds, would prove to be a natural continuation of sequential complexity, depended 
on the anticipation of tight relationships between resources on sequential models and resources 
on parallel models. Were such a hope to come true, one could argue that sequential and par- 
allel complexity are respectively a one-parameter and a two-parameter analysis of a single 
phenomenon called computational complexity, and in principle, both can be studied on the 
same model of computation through inclusion, separation and tradeoff results. The lack of ef- 
ficient simulations between the DTM and the PRAM in the context of the defacto minimum 
standard for theories of practical relevance mentioned above, means that either the DTM 
and the sequential RAM both have to be dropped from the class of “reasonable” models, or, 
to take the proposals of [VanEmdeBoasHandbookl990] and [ValiantHandbookl990] to their 
logical conclusion, two separate theories of computational complexity have to be evolved, 
each with its own resource-bounded version of Church’s Thesis, and crudely related to each 
other through the theorem of [PrattStockmeyerl976], subsequently articulated as the Parallel 
Computation Thesis. 

There have been cases of models acceptable to computability theory proving unreasonable 
for complexity theory (for example, the unary Turing machine), and the DTM, which has 
not played an important role in efficient sequential algorithm design, has been less favoured 
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than the sequential RAM [VanEnideBoasi[andbookl990]. Yet the DTM is the only sequential 
model that offers even a crude analogue of parallel time (namely reversals); there seems to be 
no corresponding complexity measure for the sequential RAM. Thus while there have been 
proposals that the DTM be dropped from the class of “reasonable” models in view of its 
unsuitability for efficient sequential algorithm design, other sequential models are no better 
than the DTM when it comes to known efficient mutual simulations with the PRAM. The 
other alternative (of evolving two independent theories of computational complexity) seems to 
suffer from what can only be described as a slight overabundance of unprovable computation 
theses. 


1.2 Statement of Results 

If it turns out that Pippenger’s 1979 result can be improved, a unified theory of sequential 
and parallel complexity might become possible. The key to the result presented here is a 
change in the definition of reversals. To distinguish our definition from the classical defini- 
tion [Hartmanisl9G8] [KamedaVollmarl970), we refer to our definition as identifying a new 
resource which is a systolically strict variant of reversals and which we call mode transitions. 
(The formal definition of mode transitions is in chapter 2.) With mode transitions as the 
analogue of parallel time and workspace as the analogue of hardware, we give an “efficient” 
mutual simulation of the DTM on the PRAM. In fact, while the parallel time used is polyno- 
miaJly related to mode transitions (not an improvement over [Pippengerl979]), we show that 
simultaneously the hardware used is linearly related to workspace (whereas [Pippengerl979] 
gives an arbitrary polynomial blowup). At the same time, we show that the simulation of 
PRAMs on the DTM (which uses mode transitions) is no worse than the previously known 
simulation (which uses reversals). One implication of this is that sequential-access oriented 
algorithms that use few mode transitions can be easily translated into “efficient” parallel al- 
gorithms. We also explore the implications of the hypothesis that PRAMs can be simulated 
“efficiently” on the DTM (using mode transitions). We show that this would tighten the 
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correspondence between the DTM and the PRAM so much that it would be comparable to 
the orthodox interpretation of the Invariance Thesis [VanEindeBoasIIandbookl990] and we 
propose a Unified Invariance Thesis to express this fact. 


1.3 The Approach of this Study 

A step in a machine computation is called a reversal step if one or more heads either shift 
for the first time or shift in the direction opposite to that in which they last shifted. A head 
may pause and resume motion in the same direction without there being a reversal step. The 
resource reversals is defined as the number of reversal steps in a computation. 

The key objection to reversals is this: with reversals, there is no restriction on when and 
how often tape heads pause during an otherwise unidirectional scan. (As has been explicitly 
specified in [Pippengerl979], the model on which the NC characterisation holds permits the 
head to pause. It is not clear to us whether a characterisation of NC is at all possible on 
DTM models in which the head is required to move either left or right on each step.) Be- 
cause of this, tape heads that start the scan together may not end it together, even when 
the total length scanned is the same, and the number of distinct head-position configura- 
tions that in principle could occur during a unidirectional scan of length s by h tapes is s^ 
(though at most h * s configurations can occur on any particular scan). Hence the hard- 
ware required for the simulation of [Pippengerl979] depends on the h‘^ power of workspace 
where h is the number of tapes. In [Pippengerl979] the author found it “disturbing that 
the degree of the polynomial- • -depends on the number of tapes possessed by the simulated 
machine; (the Hennie and Stearns result (lIennieStearnsl9GG] reducing many tapes to two 
tapes) unfortunately- - -produces an exorbitant increase in reversal.” 

Our solution is to replace reversals with a new resource which we call mode transitions. 
The new definition eliminates the key objection to reversals mentioned above. For every tape 
the “mode” of the tape is defined as the pattern of 0(1) most recent head movements. (The 
length of tliis pattern is the same for all tapes and is called tin' perioil or cyele-tiino of the 
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DTM.) A head movement is said to be expected if it cyclically repeats this pattern. If there 
are no unexpected head movements, then given the iteration count (of the unidirectional scan) 
and the inodes of all tapes, the locations of all heads are fixed. Every (unexpected) head 
movement costs separately (it is called a mode transition). This eliminates the blowup in the 
simulation of DTMs on PRAMs because the problem of simulation reduces to computing the 
transitive closure of layered bounded- width DAGs, which can be solved by an NC algorithm 
that is within a polylog factor of linear cost. 

On the other hand, the definition of mode transitions is flexible enough that the simula- 
tion of PRAMs on the DTM is no worse than the previously known simulation (which uses 
reversals). In particular, the notion of period or cycle-time is designed to allow the heads on 
different tapes to scan at different “average speeds” by following different left-right rhythms. 
This makes possible the “efficient” implementation of permutation and broadcast functions 
used in interconnection networks. A normal form result obtained by a descriptive complex- 
ity technique gives a regular structure to the access pattern of PRAM algorithms (at the 
cost of a polynomial factor blowup in the hardware) and the normalised algorithm is then 
implemented “efficiently” in terms of the interconnection functions mentioned above. 


1.4 The Descriptive Complexity Technique 

In recent years, much attention has been paid to logics that express various complexity classes. 
The original approach involved characterisation of (fragments of) conventional logics, e.g. 
existential second-order logic expresses NP (Faginl974]. Then attention focused on exten- 
sions of first-order logic by new atomic predicates (EQUAL, SUCCESSOR, LESS-THAN, 
DIVISIBLEJJYJx [folklore], BIT [Faginl974]), by new kinds of quantifiers and gates 
(MAJORITY, PARITY and THRESHOLD gates [folklore], COUNTING quantifiers [Im- 
mermanl987a] [CaiFurcrInuncrinanl989] [IminorinanLaudcrl990]) by operators that iterate 
over (syntactically constrained) functionals (e.g. LEAST-FIXED_POINT [AhoUllmanl979] 
[Immcrmanl982|, [VardiUWij, I'l’ER A'l’IVE.FlXED.POlN'r [GurevicluSl.dah 198.9], ILELA- 
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TIONAL-PRIMITIVE.RECURSION (ComptonLaFlammel988]) and by operators that could 
report the solution to some standard search problem (transitive closure [Immermanl986] or 
Hamiltonicity [Stewartl989]). Another approach studied expressibility on restricted struc- 
tures (trees [Lindelll987]). In each ca^e, the effort was to characterise the class of problems 
that could be described by the logic and to relate the complexity of the problems in this class 
to the restrictions and the complexity of the operators. This area is referred to descriptive 
complexity. 

An independent research program on the development of new database query lan- 
guages [Zloofl977], [AhoUllmanl979], [ChandraHarell980], [ChandraHarell982], [Chan- 
dral988] [AbiteboulVianul990] [AbiteboulVianul991a] had initially served as motivation but 
later descriptive complexity research broadened to include the development of logical senses 
for complexity notions (inductive depth corresponds to parallel time, number of variables 
corresponds to number of parallel processors [Iramerman 1987b]; new atomic predicates in 
the formalism correspond to relaxing the uniformity condition on circuit families, new types 
of quantifiers correspond to new types of gates in the circuit [BarringtonlmmermanStraub- 
ingl988]). 

We use the techniques developed by these Hues of rcsoarcii to structure the proof of our 
result simulating PRAMs on DTMs. While it appears to us very diflicult to directly simu- 
late the PRAM on the DTM within tight resource constraints, a logic-based “programming 
language” used as a via media turns out to be both powerful enough to e.xpress PRAM 
computations (though at the cost of a polynomial factor blowup in the hardware) as well as 
(syntactically) simple enough to permit tightly coded simulations of normalised formulae of 
this language on the DTM. 

The “programming language” we use is first-order logic extended by what we call the 
Y-operator. Such an extension is usual in descriptive complexity, but here there is one 
crucial difference in orientation: instead of choosing an extending operator that is of a priori 
interest, the Y- operator used here has been designed to make the proof simple. The Y- 
operator constructs a predicate whose arity corresponds to the processor complexity of the 
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PRAM computation as well as the space complexity of the DTM simulation. 

A “standard form” exists for the extended logic FO+Y, consisting of a propositional 
formula in the scope of a single Y-operator (with some additional nice properties that are 
of technical importance). Thus we can use the Y-operator indiscriminately while writing a 
formula expressing the PRAM computation, and then construct the equivalent formula in 
standard form before simulating it on the UTM. 

By constructing a flexible extension of first-order logic, we provide ourselves with a “high- 
level” programming language in which PRAM computations can be conveniently expressed. 
By constructing a “standard form”, we translate programs in this “high-level” language to a 
syntactically simplified sub-language that is still equivalent in power. The resulting programs 
are now simple enough in structure that we can simulate them on the relatively inflexible 
multitape DTM model in a straightforward manner. 


1.5 Organization of the Thesis 

In Chapter 2, the sequential and parallel models of computation and the resources of interest 
are defined- Then the resource bounded simulation of the DTM model on the PRAM model is 
presented. In Chapter 3, the calculus F0+Y(T,S,V) extending first order logic is defined and 
a standard form is demonstrated. Then the simulation of the PRAM model in the calculus 
is presented. In Chapter 4, the simulation of the calculus on the DTM model is presented. 
The main result is presented in Chapter 5. In Chapter 6, the results are summarised, some 
spin-off possibilities are pointed out and suggestions for future work are given. 



Chapter 2 

Mode Transitions 


In this chapter, first the sequential (DTM) and parallel (PRAM) models of computation and 
the resources of interest are defined. The definition of mode transitions and its motivation 
appears here. Then the resource bounded simulation of the DTM model on the PRAM model 
is presented. The simulation of the PRAM model on the DTM model is covered in chapters 
3 and 4. 


2,1 Basic Definitions and Claims 

Definition 2.1.1: 

The model of sequential computation is the deterministic multi- tape Turing 
machine having a read-only input tape of finite length (i.e. there are endmarker 
brackets immediately enclosing the input string beyond which the machine can- 
not go) and one or more read-write work tapes. W.l.o.g. only decision problems 
are considered, hence an output tape is unnecessary. Two resources, workspace 
s(n) and mode transitions x(n) are considered. Workspace is the well-known re- 
source defined as the total number of tape cells accessed by the read/write heads 
of all worktapes during a computation. 

Mode transitions are defined below. 


11 
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Definition 2.1.2: 

On the multitape DTM model, we define the resource mode transitions as 
follows: 

For each tape, including the input tape, defiine two terms, viz. the mode of 
the head and the mode of the tape. 

At each instant, the head of each tape is said to be in one of three modes: 
“LEFT”, “STOPPED” and “RIGHT”. Before and upto time t=0, all heads are 
said to be in mode “STOPPED”. After each step, each head is said to be in that 
mode which best describes its motion in that step. (Thus, after the halting state 
is reached, all heads are said to be in mode “STOPPED”.) 

Let the alphabet of the machine be S. Define the cycle-time or period of the 
machine to be w=[(/<J 5 f 2 (| £ |))1- 

At each instant, the mode of each tape is said to be the w-tuple of the modes 
of the head of that tape at the last w consecutive instants, including the current 
instant. (Thus, after each step, the new mode of the tape is obtained by a non- 
cyclic left-shift of this w-tuple, with the new mode of the head as the rightmost 
element of the new w-tuple. The leftmost (oldest) element of the old w-tuple is 
lost.) 

Two modes of a tape are said to be similar if they can be converted 
into each other by a single cyclic (lossless) shift of the w-tuple. Thus 
if mi,m 2 ,- ••,ma,_i,mu, are consecutive modes of the tapehead, the corre- 
sponding tape mode <mi,m 2 ,- • •,mu,_i,mu;> (the “current” mode) is similar to 
<m 2 ,-- •,mu,_i,mu,,mi> (the expected “next” mode) and to <mu„mi,m 2 ,- • •,mi„_i> 
(the presumed “previous” mode). When w=l, <mi> is similar only to itself. Note 
that similarity is not an equivalence though the transitive closure of similarity is 
an equivalence. Two modes which are not similar are said to be different. 

A mode transition by a tape at a step is said to occur if the tape is in different 
modes at the instants just before and just after that step. The mode transitions 
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used in a computation is the total number of steps during that computation at which a mode 
transition by some tape occurs. 

Claim 2.1.3: 

The definition of the resource reversals is obtained from the definition of mode transitions 
by requiring that (unlike in the case of mode transitions): 

(1) the mode of the head does not change if the head does not move, and (2) the cycle-time 
(period) w=l regardless of the alphabet size, 

that is, the mode of the head (tape) is “STOPPED” (“<STOPPED>”) until the head 
moves for the first time and is either “LEFT” (“kLEFT^”) or “RIGHT” (“<RIGHT>”) 
ever after. Counting the resource as in the case of mode transitions now gives the number of 
reversals. 

Definition 2.1.4: 

A sweep is a contiguous subsequence of the sequence of ids in a computation 
which begins with either the initial id or the id just after a mode transition on 
some tape and ends with either the halting id or the id Just before a mode tran- 
sition on some tape (not necessarily the same tape) but has no mode transitions 
on any tape in between. 

Claim 2.1.5: 

The computation sequence of a DTM can he partitioned (uniquely) into a series of sweeps 
and the number of sweeps is exactly one more than the number of mode transitions used by 
the computation. 

Definition 2.1.6: 

For the duration of a sweep, for each tape head, define the cell number of the 
cell under the head at the beginning of the sweep to be 0, incrementing to the 
right for the other cells. Let the mode of the tape be denoted modej and let 
the location of the head of tape j after the i‘^ step relative to the location at the 
beginning of the sweep, which is given by the cell number of the cell under the 
head, be denoted LOC(i,modej). 
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Claim 2.1.7: 

Knowing the mode of the tape during the sweep, the cell number of the cell which would 
be under the tape head after the step of the sweep (provided the sweep lasts that long) can 
be determined, i.e. LOC() can be computed in time polylog(i+n) by a single processor, where 
n is the input size. 

Proof: 

Let mode=<mi-- •inu;>. Let the residual right movement Del(c), l<c<w, be defined as 
(Arr(c)-Ell(c)), where Ell(c) is the number of occurrences of “LEFT” in the first c elements 
of mode and Arr(c) is the number of occurrences of “RIGHT” in the first c elements of mode. 
Then LOC(i,mode)=Del(w)*[(z/u;)J+Del(imocfw). 

I 

Claim 2.1.8: 

Knowing the mode of the tape during the sweep and the duration i (number of steps) of 
the sweep, for each cell number loc, it can be determined in time polylog(i+n) by a single 
processor, where n is the input size, whether that cell was accessed during the sweep. 

Proof: 

Let mode=<mr -•mu,>. Let the maximal movement Delmax(c) and the minimal move- 
ment Delmin(c), l<c<w, be defined as the maximum and minimum over all l<k<c of 
Del(k), where Del(k) is as defined in the proof of the previous claim. Let the duration 
of the sweep be d. Let the overshoot be defined as Over=(Delmax(w)-Del(w)) if d>w 
and Over=0 otherwise. Let the undershoot be defined as Under=(Delmin(w)-Del(w)) if 
d>w and Under=0 otherwise. If Del(w)>0, the minimum cell number of any cell that 
is accessed during the sweep is Delmin(min(d,w)) and the maximum cell number of any 
cell that is accessed during the sweep is Del(w)*[(d/tn)J+max(Over,Delmax(dmo<iw)). If 
Del(w)<0, the minimum cell number of any cell that is accessed during the sweep is 
Del(w)*[(d/u;)J+min(Under,Delmin(dmodw)) and the maximum cell number of any cell that 
is accessed during the sweep is Delmax(min(d,w)). A cell is accessed during the sweep only 
if its cell number lies in this range. 
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I 

Claim 2.1.9: 

Knowing the mode of the tape during the sweep and the duration k (number of steps) of 
the sweep, for each cell numbered (say) j which is accessed during the sweep, the largest i such 
that LOC(i,mode)=j can be calculated, i.e. the inverse function LOC'^'’ () can be computed 
in time polylog(k+n) by a single processor, where n is the input size. 

Proof: 

Let the duration of the sweep be d. We want the largest i not greater than d 
such that LOC(i,mode)=j. By assumption the cell with cell number j is indeed ac- 
cessed during the sweep, so such an i exists. W.l.o.g. assume Del(w)>0 (symme- 
try) and d>w (finite modification) and hence also j>0 (finite modification). We know 
from previous claims that LOC(i,mode)=Del(w)*[( 27 tij)j 4 -Del(imodw) and that the cell 
number of any cell that is accessed during the first i steps of the sweep is bounded 
by D€l(w)*[(z7tn)J-|-Delmax(w). This means that the cell at cell number j is not ac- 
cessed during the first w*[((j — Delmax{w))/ Del{w))\ steps of the sweep or after the first 
/ Del{w))]) steps of the sweep. This is a finite range which can be exhaustively 
enumerated to find the largest i not greater than d such that LOC(i,mode)=j. 

I 

Claim 2.1.10: 

A DTM that uses at most s(n) workspace and x(n) mode transitions uses at most 
0(($(n)+n)*x(n)) runtime. 

Claim 2.1.11: 

If x(n) is an upper bound on the number of mode transitions used on inputs of length n by 
a DTM that always halts and this bound is actually achieved for some input for each n, the 
runtime bound of the DTM is D,(x(n)). If s(n) is an upper bound on the workspace used by 
that DTM on inputs of length n, 0(c’^"'^'^‘°^^’^^) for some constant c>l is another bound on 
the runtime of the DTM. Hence it follows that log(x(n))=0(s(n)+log(n)), for x(n) and s(n) 
as defined above. 
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Claim 2.1.12: 

Let x(n) he an upper bound on the number of mode transitions used on inputs of length n by 
a DTM that always halts. If the amount of workspace used by the DTM on inputs of length n 
is s(n), s(n)=0(m<f^’^^) for some constant c>l, or, equivalently, log(s(n))=:0(x(n)+log(n)). 

Proof: 

Since the DTM always halts, it follows that it works in bounded memory. Again, since 
it works in bounded memory and halts, no configuration of the DTM other than the halting 
configuration is ever repeated during a computation. Initially all the worktapes are blank 
and the input tape is finite and of size n. The only way the DTM can increase its workspace 
is by moving its worktape head(s) into the blank region beyond what was previously accessed 
during the computation. 

Consider the amount by which the DTM can increase its workspace between two mode 
transitions, i.e. by one sweep. A mode is a finite pattern of size w, hence the DTM cannot 
move more than (w/2) steps in the direction opposite to the sweep direction (which is to the 
right if the residual right movement is positive and to the left otherwise). 

The tape cells that have been left behind (are no longer accessible without a mode tran- 
sition) have no further effect on the computation and the termination of the current sweep 
is not contingent on either the number or the contents of these tape cells. 

If the amount of workspace used by the DTM upto the start of the i‘^ sweep is f(i), and 
the i‘^ sweep terminates, the maximum number of steps in the i‘^ sweep is 0(n+f(i)). During 
this time, the maximum number of tape cells visited on those worktapes where the tape head 
is moving into the blank region is 0(n-l-f(i)), hence we have the recurrence f(i-|-l)=0(n-|-f(i)) 
and f(0)=0, where the constant implied by the big-Oh is independent of i and n. Solving 
this, we get f(i)=0(n*c') for some constant c. With i=x(n), we have f(i)=s(n)=0(n*c'^^"^). 

I 

Definition 2.1.13: 

The model of parallel computation is essentially the SIMD PRAM with 
CRCW(PRIORITY) [Goldschlagerl982] (Some details including the placement 
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of the input string are different. See below). This is a synchronous parallel ma- 
chine such that any number of processors may read or write into any word of 
global memory at any step. If several processors try to write into the same word 
at the same time, then the lowest numbered processor succeeds. £ach processor 
has a finite set of registers including the following: 

PROCESSOR: contains the processor number of the processor. This is a 
read-only register. 

ADDRESS :contains an address of global memory, 

CONTENTS:contains a word read from global memory or to be written into 
it, and 

PROGRAM_COUNTER:contains the line number of the instruction to be 
executed next. The initial content is 1 for all processors and is incremented after 
each instruction. This register cannot be addressed by any instruction other than 
BLT and HALT (see below). 

The instructions to be simulated are limited to the following: 

READ:Read the word of global memory specified by ADDRESS into CON- 
TENTS, 

WRITE:Write the word in CONTENTS to the global memory location spec- 
ified by ADDRESS, 

OP Ro Ri,:Perform operation OP on Rq and Rj leaving the result in Rj (Here 
OP may be ADD or SUBTRACT. Ra and R;, are register names), 

MOVE Ra R6:Move contents of Ro to Rj, 

INC/DEC R: Increment/Decrement the contents of register R, and 

BLT R L:Branch to line numbered L if the contents of R is less than zero. L 
must be greater than 0. 

HALT: Halt all processors whose index is greater than or equal to that of the 
processor executing the instruction. The content of the program counter register 
of all halted processors is reset to 0 and cannot change thereafter. 
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There is a common read-only program memory from which the instructions 
are read. The computation is said to have halted when all processors have halted. 

The input is a binary string of length n. All registers and global memory loca- 
tions are width(n)=[(/osf(n) + 2)] bits wide. The first global memory location con- 
tains the number n. (Integers are stored in sign- magnitude format.) The input 
string is placed width(n) bits at a time in the next \(n/width{n))] global memory 
locations. The least significant bit of the second global memory location is the 
leftmost bit of the input string, and any unused bits in the {l{n/width{n)y\+iy^ 
global memory location are 0. All these locations are read-only and attempts to 
write to them are equivalent to no-ops or skips. All subsequent global memory 
locations are read/ write and their contents are originally 0. 

The resources of interest are hardware and parallel time, denoted respectively 
as h(n) and t(n), n being the length of the input. Parallel Time is simply the 
number of synchronous steps in a parallel computation. A parallel computation 
is said to use p(.) processors if the largest index of any processor activated 
on inputs of length n is p(n). It is said to use m(.) global memory if the 
largest address of any global memory word accessed on inputs of length n is 
f(ra/u;idf/i(ra))]-f l-t-m(n). (That is, the read-only locations used for supplying the 
input are free.) The amount of Hardware used by the computation is defined to 
be h(n)=(p(n)+m(n))*width(n). 

When the amount of global memory is not of interest, or when m(n)=0(p(n)), 
the hardware is linearly bounded by the processor-wordwidth product 
p(n)*width(n). 

Claim 2.1.14; 

The CRAM model of [Immerman 1987b] differs from the CRCW(PRIORITY)PRAM 
model of [Goldschlagerl 982] only by the addition of special instructions that are of help in 
sublogarithmic time computations. For the present study, the parallel time is 0(log(n)), so 
the two models are equivalent to our model and the results here hold for the CRAM model as 
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well as the CRCW(PRIORITY)PRAM model 

Claim 2.1.15: 

The instantaneous id of a DTM that uses at mosts(n) workspace and x(n) mode transitions 
on any input of length n ( the state of the automaton, the contents of all non-input tapes and 
the head locations and modes of all tapes including the input tape; the input tape contents are 
excluded since they are part of the read-only input to the PRAM) can be represented using 
0(s(n)-hlog(n)) hardware on a PRAM. The amount of hardware is independent of the number 
of mode transitions x( n ). 

Proof: 

A tape of the DTM is represented by an array in global memory. Each cell of the tape 
contains w=0(l)-bit data (w is the period or cycle-time of the DTM). Since there are width(n) 
bits in one global memory word, [{width(n)/w)\ cells are packed in one location of the array. 
The location of the head on the tape can be specified in f(/osr(s(n)))'l bits and is represented 
in \{log{s{n))/width(n))] words of global memory. This is 0(1) for s(n)=0(poly(n)). The 
mode of the tape requires ('(u;/o5(3))l bits and is represented in \(wlog{3)/ width{n))']=0(l) 
global memory words. Each tape except the input tape of the DTM is represented in this 
manner. 

For the input tape only the head location and the mode is represented. The state of the 
machine requires 0(1) bits and is maintained in 0(1) global memory words. The total amount 
of hardware used for representing the instantaneous id of the DTM (including the head 
location and mode of the input tape but excluding the input tape contents) is 0(s(n)-l-log(n)), 
corresponding to 0((s(n)/width(n))4- 1) global memory locations. 

I 


2.2 Simulation of the DTM model on the PRAM model 

We are interested in simulating simultaneous workspace and mode transition bounded 
DTMs on simultaneous hardware and parallel time bounded PRAMs. For the pur- 
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poses of this section, conventional workspace bounds s(n) and conventional mode transi- 
tion bounds x(n) are those of the form 0{log(n)*), 0(n*') and 0*(n*') (for some integer 
k>l), with the restriction that either x(n)=fl(n) and s(n)=0(poly(x(n))) or s(n)=ft(n) and 
log(s(n))=0(poly(x(n)+log(n))). (0*(f(n)) is a shorthand for O(f(n)*polylog(f(n))).) This 
rules out DTMs working in simultaneous o(n) workspace and o(n) mode transitions from 
consideration here even if they happen to use fl(n) runtime. (Extending the following results 
for this case is open.) 

Theorem 2.2.1: 

For conventional workspace bound s() and conventional mode transition bound x(), 
such that x(n)=Q(n) and s(n)=:0(poly(x(n))), let M be a DTM that uses at most s(n) 
workspace and x(n) mode transitions on any input of length n and accepts language L. 
Then there is a PRAM algorithm that uses at most h(n)=0(s(n)-l-log(n)) hardware and 
t(n)=0(poly(x(n)+log(n))) parallel time and accepts L. 

Proof: 

The input to the simulation consists of the description of the DTM and the input string 
X in encoded form. Any suitable encoding will do as long as it takes (w|x|)+0(l) bits in aJl, 
where w is the period or cycle- time of the DTM. The encoding should be easy to parse since it 
will be accessed in the innermost loop of the simulation for obtaining the input tape contents 
and for computing the DELTA function. The instantaneous id of the DTM (including the 
head location and mode of the input tape but excluding the input tape contents) can be 
represented using O(s(n)-hlog(n)) hardware. 

The transition function of the DTM is encoded as a subroutine DELTA which assumes 
that the “current” state and the symbols under the tape heads are available in encoded form 
in some registers of the processor and returns the “next” state, the symbols to be written and 
the head movements on the tapes in encoded form in some other registers of the processor. 
The subroutine DELTA can be executed by any (or all) the processors in parallel possibly 
with different arguments. The execution of the subroutine itself has no direct effect on global 
memory. The subroutine may be executed redundantly as required for different phases of the 
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simulation. 

A naive sequential simulation can be carried out on a PRAM using only one pro- 
cessor (i.e. without using any parallel features) in time t(n)=0(poly((s(n)-fn)+x(n)-bn)) 
with additional hardware overhead 0(log(x(n))-|-log(s(n))-Mog(n)). W.l.o.g. we assume 
log(x(n)=0(s(n)-|-log(n)) (see basic claims above). Since s(n)=0(poly(x(n))), we have 
t(n)=0(poly(n*x(n)-Hn)) and h(n)=0(s(n)-flog(n)) including the additional hardware over- 
head. Also x(n)=fl(n), so we have t(n)=0(poly(x(n))), hence t(n)=0(poly(x(n)-i-log(n))) as 
required. Details are omitted. 

I 

Theorem 2.2.2: 

For conventional workspace bound s() and conventional mode transition bound x(), such 
that $(n)=Q,(n) and log(s(n))=0(poly(x(n)+log(n))), let M be a DTM that uses at most 
s(n) workspace and x(n) mode transitions on any input of length n and accepts language 
L. Then there is a PRAM algorithm that uses at most h(n)=0(s(n)+log(n)) hardware and 
t(n)=0(poly(x(n)+log(n))) parallel time and accepts L. 

Proof: 

The input to the simulation consists of the description of the DTM and the input string 
X in encoded form. Any suitable encoding will do as long as it takes (w|x|)-|-0(l) bits in ail, 
where w is the period or cycle- time of the DTM. The encoding should be easy to parse since it 
will be accessed in the innermost loop of the simulation for obtaining the input tape contents 
and for computing the DELTA function. The instantaneous id of the DTM (including the 
head location and mode of the input tape but excluding the input tape contents) can be 
represented using 0(s(n)-t-log(n)) hardware. 

The transition function of the DTM is encoded as a subroutine DELTA which assumes 
that the “current” state and the symbols under the tape heads are available in encoded form 
in some registers of the processor and returns the “next” state, the symbols to be written and 
the head movements on the tapes in encoded form in some other registers of the processor. 
The subroutine DELTA can be executed by any (or all) the processors in parallel possibly 
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with different arguments. The execution of the subroutine itself has no direct effect on global 
memory. The subroutine may be executed redundantly as required for different phases of the 
simulation. 

Only t(n)=poly(.x(n)+log(n)) parallel iterations are available on the PRAM. Though the 
DTM time bound in the worst case is 0((s(n)4-n)*x(n)), the number of sweeps is at most 
x(n)+l and the goal is to implement each sweep using a small amount of parallel time. The 
simulation consists of an initialization phase, a construction phase of x(n)+l iterations (one 
iteration per sweep) and a reporting phase. The initialization phase, the reporting phase and 
each iteration of the construction phase are implemented in poly(x(n)+log(n)) parallel time 
with O(s(n)+log(n)) hardware, thus satisfying the requirements. 

In the initialization phase, all the arrays (except the input) are written with the “blank” 
symbol, the heads are positioned in the center of the arrays and the modes are set to 
STOPPED"'. (Since it is not known in advance whether the DTM will use the tape to 
the left of the initial head position or the tape to the right of it, s(n) space is allocated on 
each side of the initial position. Hence the head is in the center of the array.) 

Each iteration of the construction phase consists of a setup stage, a path-finding stage and 
a commit stage, which together simulate one sweep of the DTM computation. In the setup 
stage a directed layered (acyclic) graph is constructed which encodes all possible computation 
sequences during the sweep. In the path-finding stage, using the arrays in global memory, the 
tape modes, the tape head locations, the DTM-state and the transition function DELTA, the 
actual computation sequence is constructed as a path in this graph. The use of an efficient 
parallel algorithm for finding this (long) path quickly leads to the speedup of the sweep. In 
the commit stage, all the arrays in global memory, the tape modes, the tape head locations 
and the DTM-state are updated to represent the computation id at the end of the sweep. 

Finally, in the reporting phase, the output of the DTM computation (accept/reject) is 
taken as the output of the PRAM simulation. Essentially, each sweep is treated as a reacha- 
bility problem on a restricted class of graphs (layered DAGs). In pseudo-code, this algorithm 
may be summarised as follows: 
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function simulate(tapeO:global-array of tape-symbols):bool 
/* tapeO is the input tape */ 

alloc tapel,tape2,- • -jtapehrglobal-array of tape-symbols, 
mode0,model,mode2,- - - jmodeh.-tape-mode, 
head0,headl,head2,- • -jbeadhunteger, /* location of the head */ 
state:DTM-state; 
begin 

modeO-STOPPED“';headO*-0; 

for(i=l;i<=h;i-l--t-){tapei^BLANK;modei<-STOPPED“';headi<-0;} 
states— START; 

while((state<>ACCEPT)and(state<>REJECT)){ 
construct-graph; find-path; update-status; 

} 

if(state==ACCEPT)return(TRUE); else return(FALSE); 
end. 

In the pseudo-code above, the call to procedure construct-graph represents the setup stage, 
the call to procedure find-path represents the path-finding stage and the call to procedure 
update-status represents the commit stage. It is now shown how these are implemented. 

The vertex set of the graph is constructed as follows: The maximum number of steps in a 
sweep is equal to the workspace bound w+s(n), which also determines the size of the arrays 
(n=LIN is the input length). Let Q be the set of states of the machine. Let S be the tape 
alphabet. Let S'" be the set of w-tuples of elements of E. (w is the cycle-time or period of 
the machine) The most recent w tape symbols read by a tape head (including the symbol 
currently under the head) can be represented by an element of S’". Let h be the number of 
tapes. For the tape head, let ajeS*" denote the most recent w symbols read by the 
tape head, as above. For each i:0<i<s(n), each beQ and each j:l<j<h, introduce a vertex 
with label <i,b,ai,- • -,aj,- • •,aA> in the graph. 

The interpretation is as follows: Each vertex with label (say) <i,b,ai,- • •,aj,- • •,aA> repre- 
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sents a hypothesis that just after the step of the sweep, the machine is in state b, relative 
to the beginning of the sweep the location of the tape head is LOC(i,modej) and the 
most recent w symbols read/scanned by the tape head is &j, the symbol currently being 
scanned being last(aj) (Note that the symbols written during the last w steps are not ex- 
plicitly represented in the label). The computation id just after the i‘^ step of the sweep is 
correctly represented by an initially unknown such vertex. Identifying, for each i, precisely 
which vertex is the representative one is the goal. Let this representative vertex be denoted 
LAB(i). (The vertices are named by their labels.) 

Under the hypothesis that the representative vertex just after the i*^ step of the sweep 
is u (LAB(i)=u), the label of the representative vertex v just after the (i-f-1)*^ step of the 
sweep (LAB(i-f-l)=v) can be determined from the transition function DELTA and the label 
of u, together with the initial conditions of the sweep (i.e. the machine state, tape contents 
and head positions at the beginning of the sweep and the tape modes during the sweep). In 
the graph introduce an edge from u to v to represent this relationship, which we denote as a 
function NEXT(u)=v. Since the machine is deterministic, the outdegree of each vertex is at 
most 1. 

NEXT(.) can be implemented efficiently as a subroutine. To see this, suppose 
u=<i,b,ai,- ■ •,aj,-- ■,a;i>. Knowing the initial conditions of the sweep, we can compute for 
each tape j the last w locations of the head of tape j as LOC(i-w-|-l,modej),- • •,LOC(i,modej). 
The symbols scanned at the last w steps are given by the elements of w-tuple aj-GS*", l<j<h 
in the label of u. Using the current state b and the symbols currently being scanned, the 
subroutine DELTA returns the symbols to be written and the head movement if any on each 
tape. The head movements specified by DELTA enable us to compute the new locations of 
all the heads. 

We claim that these new locations are either being accessed for the first time during the 
sweep or are one of the last w locations computed above. (The cells left well behind cannot 
be accessed again during the same sweep.) If the location is being accessed for the first time, 
we obtain the new symbol under the head by looking up the global array. If it is one of 
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the last w locations, we determine the last time that location was accessed, recompute the 
symbol that was written at that step, and take that as the new symbol under the head. Thus 
we can construct the label of the next vertex and this is v. 

As a bonus, we compare the head movements specified by DELTA with the known modes 
of the tapes, and this tells us whether a mode transition has occurred during this step. If it 
has, u is the last vertex of this sweep and v=NEXT(u) is the first vertex of the next sweep. 

The correct representative vertices LAB(i), 0<i<(w*s(n)), form a directed path that rep- 
resents the computation sequence. The vertex representing the start of the sweep can be 
easily identified and is designated the start vertex s. The number of vertices is 0(s(n)), and 
the graph is acyclic and layered (edges go only from one layer to the next), indexed by the 
first component of the vertex label. Thus there exists a unique longest path from the start 
vertex s upto the vertex t representing that step of the sweep at which the machine either 
halts or makes a mode transition. (In general, this may occur after i steps, for some i less 
than or equal to the maximum possible length of the sweep.) To simulate the computation 
of the machine, this path has to be found, i.e. each vertex LAB(0)=s,LAB(l),- • •,LAB(i)=t 
on this path has to be constructed. 

The setup stage procedure construct-graph may be described in pseudo- code as follows: 
alloc PRED:global-array of vertex-labels indexed by vertex- labels; 
procedure construct-graph 
seq-begin 

for(each layer i)par-begin 

for(each vertex v of layer i)seq-begin 

if(layer(v)==0)PRED(v)<— {v}; else PRED(v)-s-{}; 

if(layer(v)>0)seq-begin 

for(each vertex u of layer layer(v)-l) 

if(NEXT(u)==v)PRED(v)4-PRED(v)U{u}; 

if(PRED(v)=={})PRED(v)<-{adopter(layeri-l)}; 

seq-end 
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seq-end 

par-end 

/* LOOP INVARIANT MENTIONED IN THE TEXT HOLDS AT THIS POINT *f 
seq-end; 

The path-finding procedure find-path is now presented as a multi- process algorithm. Note 
that the graph is acyclic and layered. One process is activated for each vertex. (The label 
of the vertex is itself the pid of the process and is referred to in the following pseudocode as 
SELF.) Each process tries to find whether its vertex is connected to the start vertex s, and 
it maintains a record in a global memory array PRED to represent the labels of the farthest 
vertices to which it knows it is connected. Let the contents of record PRED of processor i be 
denoted by the shorthand [i] (i.e. PRED(i) is synonymous with [i]). Let the PRED records of 
processors allocated to vertices of the layer (prospective initial states of the sweep) hold 
the value SELF (i.e. PRED(i)={i} for such i). Note that PRED(i) is a set of labels. 

The initial contents of PRED(v) are all the immediate predecessors of vertex v, which can 
be determined by trying each vertex u of the previous layer in turn; there are only a constant 
number of such vertices u.) It is desired to maintain the loop invariant that all the elements 
in PRED(v) are from a single layer (and hence their number is bounded by a constant). 
However, some vertices of layers other than the layer may be orphans; i.e. they may not 
have predecessors in the sense given above. To handle these, introduce a chain of “adopter” 
vertices and set the predecessor of orphans of the (i-l-l)*^ layer to the “adopter” vertex of the 
i*^ layer. The “adopter” vertex of the layer is its own predecessor. 

PRED is essentially the inverse of NEXT. However, because of the adopter vertices, the 
graph defined by NEXT is a subgra/ph of the graph defined by PRED. The algorithm works 
as follows: 

For 0(log((l-[-s(n))+(|Q|)(lE|“''‘)))=0(log(s(n))) (sequential) iterations, each process exe- 
cutes the statement PRED ♦-[[SELF]]; (i.e. process i executes PRED(i)+-{k|3j(j6[i]Ak€[j]}). 
The global array PRED is updated by synchronising all these processes at the end of each 
iteration. At the end of this loop, the PRED record of a process will hold the label of the 



CHAPTER 2. MODE TRANSITIONS 


27 


start vertex s if and only if it is on the desired path. The array PRED is in global memory 
and constitutes the sole representation of the graph under construction. Processess obtain 
the label of the vertex they represent simply by parsing their own pid. 

The path-finding stage procedure find-path may be described in pseudo- code as follows. 

procedure find-path 

seq-begin 

for(log(Number-of-layers) iterations)seq- begin 

/* LOOP INVARIANT MENTIONED IN THE TEXT HOLDS AT THIS POINT */ 

for(each vertex w) par- begin 

TEMP(w)^{}; 

for(each vertex vGPRED(w))for(each vertex uGPRED(v))seq-begin 

TEMP(w)^TEMP(w)U{u}; 

seq-end 

PRED(w)^TEMP(w); 

par-end 

seq-end 

/* LOOP INVARIANT MENTIONED IN THE TEXT HOLDS AT THIS POINT */ 
seq-end; 

Once the path is obtained, the array LAB can be used as a lookup table or function. 
Now it is easy to obtain the new contents of all the tapes. For each location loc accessed 
during the sweep, knowing modej, the mode of tape j, we can calculate the largest i such that 
LOC(i,modej)=loc, which is the time of last access of that location. The vertex with label 
LAB(i,modej)=<i,b,ai,- • ^aj,- • •,a/i> (say) represents the now known fact that just after the 
i‘^ step of the sweep, the machine is in state b, relative to the beginning of the sweep the loca- 
tion of the tape head is LOC(i,modey)=loc and the most recent w symbols read/scanned 
by the j*'* tape head is ay, the symbol currently being scanned being last(ay). Then the 
subroutine DELTA when called with arguments b,last(ai),- • •,last(ay) returns (among other 
things) the final contents of the cell at location loc on tape J. 
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The commit stage procedure update-status may be described in pseudo- code as follows: 

procedure update-status 

seq-begin 

for(each tape tapej)seq-begin 

for(each location loc on tapej accessed during the sweep )par-begin 

i<— time of last access of location loc on tapej; 

status^— proJection(LAB(i,modej)); 

newvaJue<— projection(DELTA(status)); 

tapej[headj-l-loc] •(— newvalue; 

par-end; 

headj-!— headj-f-(last location accessed on tapej); 
modeji— last w head movements of tapej; 
seq-end; 

state+-DTM state at end of sweep; 
seq-end; 

It has already been shown in the proofs of previous lemmas that primitives like computa- 
tion of head movements can be accomplished in time polylog(s(n)-|-n) by a single processor. 
It remains to be shown how the work in the algorithm described above is distributed among 
the processors so as to achieve the required resource bounds. 

A naive implementation of the algorithm described above would use one processor per 
process, O(log(s(n))/log(n)) global memory words for temporaries per processor (needed in 
general since s(n)=f2(n); may be omitted if s(n)=0(poly(n))), and O(s(n)log(s(n))/log(n)) 
global memory words for common data, hence 0(s(n)) processors and O(s(n)log(s(n))/log(n)) 
global memory words in all and t(n)=poly(x(n)-i-log(s(n))-|-log(n)) parallel time. Since 
log(s(n))=0(poly(x(n)-|-log(n))), we have t(n)=poly(x(n)-|-log(n)). 

Since the number of bits per register/global memory word is 0(log(n)), this means 
O(s(n)log(s(n))) hardware, which is too much. By using only 0(s(n)/log(s(n))) processors 
and letting each of them (sequentially) simulate log(s(n)) processes of the algorithm above. 
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the amount of hardware for processors and global memory words used for temporaries can be 
reduced to 0(s(n)) at the cost of a poly(x(n)+log(n)) factor increase in parallel time (since 
log(s(n))=0(poly(x(n)+log(ii)))). 

However, the memory requirement for common data is still too large. To reduce this, the 
PRED records have to be implemented using only 0(1) bits. This can be done by taking 
advantage of the layered nature of the graph. The introduction of the adopter vertices ensures 
that the following loop invariant holds: 

LOOP INVARIANT: The layer number of each of the elements in the PRED record of a 
vertex of the i‘^ layer after j iterations of the loop is the same and depends only on i and j 
(it is equal to max(i-2-’,0)) 

This can be recomputed in each iteration. Hence it suffices to omit the first component 
of the label of the vertices which gives their layer and store only the remaining components 
of the label (i.e. <b,ai,- • -ja^,- • •,a;i>) to preserve all the information of record PRED. In 
each layer, there are only a finite number of vertices. Thus 0(log((|Q|)*(|E|'"^)))=0(l) bits 
per vertex are enough and the memory requirement reduces to 0(s(n)+log(n)) bits. When 
the processing for a vertex is being simulated, the 0(1) bits correponding to that vertex are 
uploaded into the global memory words used for temporaries by the simulating processor 
for calculations, then downloaded after the simulation of that vertex is through. So the 
number of bits in global memory words used for temporaries by the simulating processors is 
0(s(n)/log(s(n)))+0(log(s(n))/log(n))+0(log(n))=0(s(n)) which is acceptable. 

I 

Because the graph is acyclic, layered and bounded-width, only 0(1) hardware per vertex 
(O(s(n)-l-log(n)) for the whole graph) rather than 0(1) hardware per edge (0((s(n)-l-log(n))^) 
for the whole graph) is needed to achieve poly(x(n)-blog(n)) parallel time. Thus a processor 
blowup similar to that faced by Pippenger [Pippengerl979] is avoided. The importance of 
the notion of cycle-time in the definition of mode transitions will become evident in Chapters 
3 and 4, where it will be shown that a polylog(n) bound on mode transitions, though more 
restrictive than a corresponding bound on reversals, is still enough to obtain an analogue of 
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Pippenger’s simulation result [Pippengerl979] for reversals, given a polynomial blowup in the 
workspace. 



Chapter 3 


The Calculus FO+Y(T,S,V) 


In this chapter, first the calculus FO+Y(T,S,V) extending first order logic is defined. An 
interpretation of syntactic restrictions on formulas as resource bounds on an abstract model of 
computation is given. Then the expression of PRAM computations by syntactically restricted 
formulas is presented. The implementation of the calculus on the DTM model is covered in 
chapter 4. 

The work reported in this chapter is an extension of Immerman’s work on the expression of 
parallel computations and focuses on separating the representation of time instants from the 
representation of space locations. The price paid for this is the predominant operationality 
of our approach which eliminates the possibility of our calculus being used directly as a 
programming language. However, the purpose of this work is to prove a complexity theoretic 
characterisation result and we feel the use of a descriptive approach has greatly cased the 
task of discovering as well as presenting the proof. 


3.1 Definition of the Calculus 

Structures shall have three disjoint finite domains which shall be known as the linear domain 
LIN, the logarithmic domain LOG and the binary domain BIN. The number of elements in 
LIN |LINi=n. iL0Gl = r(W2n)l- It is as.sumcd that n>4. |I1IN|=2. 
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Over each domain, there shall be a predefined successor function SUC that imposes a total 
ordering on the domain. There shall be a predefined binary equality predicate EQ over each 
domain. Strict typing shall be enforced and ambiguity can be resolved from the context. 

W.r.t. SUC, the domains may be thought of as initial subsets of the natural numbers. 
The smallest element of a domain D6{LIN,LOG,BIN} is the predefined constant D1 and the 
largest element of D is the predefined constant Doo. Ambiguity can be resolved from the 
context. SUC(Doo) shall be Doo itself. 

Structures shall have a predefined doubling function DOUBLE over each of the domains. 
DOUBLE(x)=if((x+x)<Doo)then(x+x)else(Doo). 

Structures shall have predefined functions FROMLOG:LOG-^-LIN and TOLOG:LIN-»-LOG. 
FROMLOG(-x)=x and TOLOG(x)=min(x,f(Zoff2?i)l)- 

Structures shall have predefined functions FROMBINrBIN-s-LIN and TOBINrLIN— >BIN. 
FROMBIN(x)=x and TOBIN(x)=min(x,2). 

In all the above, we use the interpretation of domain elements as numbers. Thus: 

FROMLOG(LOGl)=LINl; 

TOLOG(LINl)=LOGl; 

TOLOG(LINoo)=LOGoo; 

T0BIN(LIN1)=BIN1; 

FR0MBIN(BIN1)=LIN1; 

TOBIN(LINoo)=BINoo; 

FROMBIN(BINoo)=SUC(LINl). 

Structures shall have a unary predicate INP over LIN which shall not be predefined and 
which shall represent a binary “input” string of length n, or the extensional database. 

The logic FO consists of all formulas built up in the usual way from the typed symbols 
SUC, DOUBLE, TOLOG, FROMLOG, TOBIN, FROMBIN, EQ and INP together with 
logical connectives (A, V, -<), typed quantifiers (V, 3), typed domain variables (x, y, z, t, s, 
ti, Si, ta, sa, tb- • ■) and typed predicate variables (P, Q, R- • •). The predicate variables will 
be permitted only in suitable extensions of FO. 
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Concatenation of tuples and sequences shall be indicated by the symbol which also 
denotes multiplication. When the same domain element is to be repeated in succession in a 
tuple, the number of consecutive occurrences shall be indicated as an exponent. 

For a Time Domain T6{LIN,L0G}, a Space Domain S€{I/IN,LOG}, and a Variance 
Domain Vg{LOG,BIN:(Voo<Soo)a(Voo<Too)}, the calculus FO+Y(T,S,V) is obtained by 
augmenting FO with the following additional formation rule: 

Let c,k,q>l be any integers, let 0 be a formula of the calculus, let p(?+*) be a (q+k)-ary 
predicate variable over V’S*', let t be a c-tuple of distinct variables over T, s a k-tuple of 
distinct variables over S and r a q-tuple of distinct variables over V and let x be a (q+k)- 
tuple of (not necessarily distinct) constants, variables and compound terms over V®S^. Let 
the symbol ^ denote the formula Y(T,S,V)[P,t,r,s](0)[x]. 

Then $ is a formula of the calculus. All free occurrences of the variables of t, r and s 
and the predicate variable P in 0 are bound in by the operator Y. The free variables of 
are the remaining freely occurring variables of 0, as well as the variables of x. (This also 
covers the free predicate variables). The calculus is closed under the usual FO connectives 
and quantifiers, as well as nesting of the operator Y in 0. The Y-operator constructs an 
interpretation, the Y-interpretation, of the predicate variable P, and '9 holds iff P(x) holds 
in the Y-interpretation (in the context of an interpretation of the free symbols). Y computes 
the Y- interpretation of P by the following pseudo-algorithm: 
begin 

P <— {}; /* Empty Set */ 

for t <- to Too'^ do /^Iteration by lexicographic ordering with t€T‘=*/ 

P <- { r*s I r+s=<ri,r 2 ,-'-jP 9 >si> S2,---,sjk> A 
Vl<j<q[rjeV] A Vl<j<k[sjeS] A 
0(P,t,r,s) holds }; 

/* Non-monotonic update in parallel for all tuples r+s */ 
end. 

If the enclosed formula 0 contains an occurence of the Y-operator, then the non-monotonic 
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update in parallel mentioned above requires a different Y-interpretation of this enclosed Y- 
operator for each binding of values to variables by the outer Y-operator. In other words, the 
enclosed Y-operator is invoked in parallel for each binding of r and s. 

Occurrences of Y that are nested in the scope of another occurrence of Y are invoked 
over and over again at every stage (iteration) in the evaluation of the outer operator. The 
vocabulary of the expressions in the scope of the Y operator can vary from occurrence to 
occurrence, because of predicate variables bound outside its scope, though the symbols SUC, 
DOUBLE, TOLOG, FROMLOG, TOBIN, FROMBIN, EQ and INF are common. However, 
for each occurrence, the vocabulary is fixed, and the semantics of Y is defined in terms of the 
invocation time interpretations of the symbols of that vocabulary. This defines the semantics 
ofY. 

Formulas of FO-f-Y(T,S,V) which do not contain any unbound variables can be seen to 
define a global predicate on finite binary strings. Thus each formula corresponds to a set of 
strings and the calculus expresses a class. The same descriptor shall be used to refer to the 
calculus and the class it expresses. Ambiguity can be resolved from the context. 

The notion of ’’Variance Domain” introduced here generalizes simultaneous induction such 
that the number of predicates being simultaneously defined becomes a slowly growing function 
of the input size. 


3.2 Normal Form Results for FO-l-Y(T,S,V) 

Definition 3.2.1: 

The constants c, k and q (specified in the definition of the calculus) are re- 
ferred to as the time arity, the space arity and the variance arity respectively 
of the Y-operator. The time arity (respectively space arity, variance arity) of a 
formula is defined to be equal to the largest among the time arities (respectively 
space arities, variance arities) of the Y-operators that occur in the formula. The 
nesting-depth of a formula is defined to be equal to the length of the longest 
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chain of occurrences of Y-operators in the formula such that each successive oc- 
currence in the chain is in the scope of the previous one. A formula in prenex form 
consists of a propositional formula in the scope of a nested sequence of zero or more first 
order quantifiers over the linear domain LIN and Y-operators, where no Y-operator is nested 
immediately inside the scope of another Y-opemtor. 

W.l.o.g. we assume that disjunctions do not occur in formulas. 

Definition 3.2.2: 

The height of each negation or conjunction in a formula is the nesting depth 
of the subformula in its scope. The height of a Y-operator is 1 if there is another 
Y-operator immediately enclosed inside its scope and 0 otherwise. The height of 
a formula is the largest among the heights of all its negations, conjunctions and 
Y-operators. Note that the height of a formula in prenex form is 0. W.l.o.g., if 
0 is a formula of height greater than 0, there is a unique (modulo trivial first 
order simplifications) leftmost innermost negation/conjunction/Y-operator in 0 
whose height is equal to the height of 0 and which immediately encloses a Y- 
operator. The formula consisting of this negation/conjunction/ Y-operator and 
the formula(s) inside its scope is said to be the redex of 0. 

Claim 3.2.3: 

let 0 be a formula not in prenex form. The height of the irdcx of 0 is equal to the height 
of 0. The height of any subformula inside the scope of the outermost operator of the redex 
of 0 is strictly less than the height of 0. 

Lemma 3.2.4: 

Given a formula of FO+Y(T,S,V), one can construct an equivalent formula in prenex 
form with the following properties: (l)The nesting depth does not increase. (2) The number 
of Y-operators does not increase. (3)The number of first order quantifiers does not increase. 

Proof: 

By structural induction. Essentially it is shown that propositional operators can be mi- 
grated inward through Y-operators. In addition, if a Y-operator is nested iinincdiatcly inside 
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the scope of another Y-operator, they can be replaced by a single Y-operator. 

Let R be a reduction function which, given ais argument a formula 0 not in prenex form, 
constructs formula R(0) by replacing the redex of 0 with a formula which is equivalent to 
the redex of 0 and whose height is strictly less than the height of the redex of 0, without 
increasing the nesting depth, the number of Y-operators or the number of first-order quan- 
tifiers. Then by repeated applications of R and trivial first order simplifications, one can 
construct from formula 0 a formula $ which is equivalent to 0, which is in prenex form and 
which satisfies the properties listed in the statement of the lemma. VV.l.o.g. the redex of a 
formula 0 not in prenex fornv is in one of the following forms: 

1. -(Y(T,S,V)[P,t,r,s](G)[x]) 

2. F A Y(T,S,V)[P,t,r,s](G)[x] 

3. Y(T,S,V)[P,t,r,s](H)[y] A Y(T,S,V)[P,t,r,s|(G)[x] 

4. Y(T,S,V)[P,ta,ra,sa](Y(T,S,V)[Q,tb,rb,sb](G)[x])[y] 

where F, G and H are formulas and the outermost operator of F is not a Y-operator. It 
is sufficient to handle these forms to obtain a suitable reduction function R. 

Let a redex be in form 1. Increasing the arity of the time domain from c to c-f-1 (thus 
introducing a new time variable ta), construct a single formula Y(T,S,V)[P,ta*t,r,s](I)[x] 
with the following formula I: 

[ta=Tl A G] V [ta*t=Too‘^+' A -iP(rs)] V 

[ta^^Tl A ta+t^iToo'^'^* A P(rs)] 

The resulting formula is equivalent to the redex and its height is less by 1. 

Let a redex be in form 2. Rename the variables P, t, r and s bound by the Y-operator 
to avoid conflicts. This changes G to G' (say). Increasing the arity of the time do- 
main from c to c-fl (thus introducing a new time variable ta), construct a single formula 
Y(T,S,V)[P,ta+t,r,s](I)[x] with the following formula 1: (note that the P, t, r and s bound by 
the Y-operator in the formula under construction represent the renamed variables) 

[ta=Tl A G'] V (taj^Tl A F A P(rs)|. 

The resulting formula is c<iui valent to the redex and its height is less by 1. 
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Let a redex be in form 3. A time domain arity of c=max(cl,c2)4-l, a space domain 
arity of k=max(kl,k2) and a variance domain arity of q=max(ql,q2)-|-l are used. By in- 
troducing conjuncts tying do%vn some of the bound variables to fixed values, one can en- 
sure that computation progresses only for the required time. The approach is to com- 
pute the two Y-interpretations in separate workspaces, query these on y and x, and store 
the conjunction of the truth values so obtained in a third workspace. These workspaces 
are discriminated by the most significant bound variable of r. Construct a single formula 
Y(T,S,V)[P,ta*t,ra*r,s](I)[Soo^’+*)] with the following formula I: (ta and t are 1- and (c-1)- 
tuples of variables over T, ra and r are 1- and (q-l)-tuples of variables over V and s is a 
k-tuple of variables over S) 

[ta=Tl A ([ra=Vl A H'(t,r,s)] V [ra=SUC(Vl) A G'(t,r,s)])] V 
[ta*t=Too= A P(V1*/) A P(SUC[Vl]+x')] V 
[taj^Tl A ta+tj^Too® A P(ra*r*s)] 

where H' and G' are obtained from H and G by introducing conjuncts that tie down unused 
variables to D1 and y' and x' are obtained from y and x by prefixing enough occurrences of 
D1 to match the required arity. This resulting formula is equivalent to the original redex and 
its height is less by 1. 

Let a redex be in form 4. A time domain arity of c=cl-|-c2-l-2, a space domain arity of 
k=kl-l-k2 and a variance domain arity of q=ql-f-q2-|-l are used. By introducing conjuncts 
tying down some of the bound variables to fixed values, one can ensure that computation 
progresses only for the required time. The approach is to store the Y-interpretations in 
separate workspaces during the nested computations. There is one outer Y-interpretation 
and Voo’^Soo^' inner Y-interpretations, one for each binding of the variables of ra and sa. 
The inner Y-interpretations are evaluated on x in every stage of the computation of the 
outer Y-interpretation. When the outer Y-interpretation has been computed, it is evaluated 
on y and the truth value ol)taino<l is stored in a third workspace. These workspaces are 
discriminated by the most significant bound variable of the variance domain. 

Let x=xv+xs and y=yv*ys where xv and yv are q2- and ql-tuples of terms over V and xs 
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and ys are k2- and k 1-tuples of terms over S. Construct a single formula Y(T,S,V)[R,tc*ta 4 td*tb,r 
with the following formula I: (tc, ta, td and tb are 1-, cl-, 1- and c2-tuples of variables over 
T, re, ra and rb are 1-, <il- and (i2-tuples of variables over V and sa and sb are kl- and 
k2-tuples of variables over S) 

[tc=Tl A td=Tl A ([rc=Vl A R(rc*ra+rb+sa*sb)] V [rc=SUC(Vl) A G'])] V 
[tc=Tl A td*tb=SUC(Tl)^^'*'^ A (rc=Vl A R[SUC(Vl)*ra*xv*sa*xs])] V 
[tc=Tl A tdT^Tl A td*tb?^SUC(Tl)‘=2+» a R(rc*ra*rb*sa+sb)] V 
[tc*ta*td*tb=SUC(Tl)'*'*‘'^^'*'^ A R(Vl+yv+rb*ys*sb)] V 
[tCT^Tl A tc*ta*td*tb 7 ^SUC(Tl)‘^'‘*''^‘'‘^ A R(rc+ra+rb+sa*sb)] 

where is obtained from G by introducing conjuncts that tie down unused variables to 
Dl. This resulting formula is equivalent to the original redex and its height is less by 1. The 
principle here is that all ’’computation” takes place while tc=Tl (the first three disjuncts), 
except for the test of the outer predicate on the tuple y, which is done when the variables 
of tc, ta, td and tb are all equal to SUC(Tl) (the fourth disjunct). The last (fifth) disjunct 
says that at all other times the status quo is preserved. The first three disjuncts follow a 
similar principle for the computation of the inner Y-operator. 

These constructions together give a reduction function R. Hy induction alternately on the 
number of negations/conjunctions whose height is equal to the height of the formula while 
there is at least on^ such negation/conjunction and then on the number of Y-operators whose 
height is non- zero, it follows that each formula is constructively equivalent to some formula 
in prenex form. It suffices to note that the current induction variable strictly decreases in 
each step and induction variables of yet- to- come steps are not increased by more than a finite 
amount in the current step. This completes the proof of the lemma. 

I 

Definition 3.2.5: 

A formula is said to be in collapsed form if it consists of a propositional formula 
in the scope of at most a single Y-operator. 

Note that a formula in collap.sod form ha.s no first order quantifiers in it. 
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Lemma 3.2.6: 

The formulas 

of the calculi FO+Y(LIN,LIN,V), FO+Y(LOG,LIN,V) and FO+Y(LIN,LOG,BIN), where 
V£.{LOG,BItT\, have a collapsed form, i.e. given a formula of one of these calculi, one can 
construct an equivalent formula in collapsed form. 

Proof: 

Note that the algorithm in the proof of the previous lemma which converts a formula into 
prenex form does not ever increase the number of quantifiers or Y-operators or the nesting 
depth. Thus if a way were to be found of simulating quantifiers over the linear domain LIN 
using Y-operators, negations and conjunctions, it would be possible to attain a collapsed 
form. (Nonessential use of disjunctions is made below for clarity.) 

The following formula gives such a simulation when the space domain is LIN, i.e. in the 
case of FO-hY(LIN,LIN,V) and F0-1-Y(L0G,LIN,V):(V6{BIN,L0G}) 
Y(T,S,V)[P,t,r,s](F')[Sl] simulates VsF(s) for some formula F with F' as 
[t=Tl A F(s)] V [tj^Tl A tj^Too A P(s) A P(DOUliLE(s)) A P(SUC(I>OUDLE(s)))] V 
(t=Too A P(S1) A P(SUC(S1)) A P(SUC(SUC(S1))) A P(SUC(SUC(SUC(S1))))] 

The following formula gives such a simulation when the time domain is LIN, i.e. in the 
case of FO-bY(LIN,LIN,V) and F0-t-Y(LIN,L0G,BIN):(V6{BIN,L0G}) 
Y(T,S,V)[P,t,r,s](F')[Sl] simulates VtF(t) for some formula F with F' as 
[t=Tl A F(t)| V [tj-^Tl A P(S1) A F(t)] 

Existential quantifiers are handled similarly. The collapsed form is obtained by first suc- 
cessively simulating all occurrences of first-order quantifiers in the manner described, and 
then taking the prenex form as per the previous lemma. Since there are no first-order quan- 
tifiers left and none are added while taking the prenex form, and since immediately nested 
Y-operators are not allowed in the prenex form, it follows that the formula must be in col- 
lapsed form. 

I 

It follows that formulas of FO-f-Y(LIN,rjIN,V), FO+Y(LOG,LlN,V) and FO+ Y(LIN,LOG,V) 
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(V€{BIN,L0G}) have a collapsed form. We shall call these as collapsible calculi. 

Corollary 3.2.7: 

Restricting the nesting depth to I docs not affect the expi-essive power of collapsible calculi if 
the time domain arity, the variance domain arity and the space domain arity are unrestricted. 

Lemma 3.2.8: 

Restricting the time domain arity to 1 does not affect the expressive power of 
FO+Y(T,S, V) when nesting is ])ermitted. 

Proof: 

Let Y[P,t,r,s](F)[x] be some formula with time-arity c, variance arity q and space-arity 
k. Let PI, P2,"--,Pc be c new predicate variables over V^S^’. Then the following formula is 
equivalent to the given formula and its time-arity is restricted to 1: 
Y[Pl,tl,r,s](Y[P2,t2,r,s](.-.(Y(Pc,tc,r,s](F')[r*s])---)[r*s])[x] 
where F' is 

[(t2=Tl A t3=Tl A ••• A tc=Tl A F[P1/P]) V 
(-(t2=Tl) A t3=Tl A ••• A tc=Tl A F[P2/P]) V 
(^(t3=Tl) A t4=Tl A ••• A tc=Tl A F[P3/P]) V 
( ••• ) V (-(tc=Tl) A F{Pc/P]) 

] 

and F[Q/P] denotes the replacement of all unbound occurences of predicate symbol P in 
formula F with the predicate symbol Q. 

tl, t2, etc. are the variables of the c-tuple t. No renaming of domain variables is necessary. 
The principle here is that the last stored value of the predicate P is at the innermost Y- 
operator which is not on the first iteration of the current invocation. The only exception is 
the first step of the entire computation, when we use the PI <— {} (Empty Set) initialisation 
to start up. Note that the space arity is not affected. 

I 

Corollary 3.2.9: 

Every formula of a collapsible calculus has a form which is a propositional formula in the 
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scope of a sequence of immediately nested time-monadic Y-operators. 

In the rest of this study, only the collapsible calculi are considered. 

The following definition and lemma of technical interest is used in Chapter 4: 

Definition 3.2.10: 

A formula Y(T,S,V)[P,t,r,s](F)[w] is said to be in standard form if it is in 
collapsed form with the additional property that there is no occurrence of the 
functions SUC and DOUBLE over the LOG and BIN domains in the formula and 
in any (q+k)-tuple x which is an argument of an occurrence of P in F, the first 
q terms of the tuple (which in general may be constructed from any constant or 
variable and the predefined functions, but which take values from the variance 
domain) do not contain any occurrence of the variables of s. 

Lemma 3.2.11: 

Given any formula Y(T,S, V)[P,t,r,sJ(F)lw] in collapsed form, with time arity c, space ar~ 
ity k and variance arity q, one can construct an equivalent formula Y(T,S,V)[P,t',r,s](F)[wJ 
in standard form, with time arity cf, sjMce arity k (unchanged) and variance arity q (un- 
changed). 

Proof: 

Occurences of the functions SUC and DOUBLE over the domains LOG and BIN are han- 
dled by functional composition from those over the domain LIN, together with the functions 
FTL0G(.)=FR0ML0G(T0L0G(.)) and FTBIN(.)=FR0MBIN(T0BIN(.)). This is done 
using the following identities repeatedly as (left-to-right) rewrite rules: 
T0L0G(FR0ML0G(x))=x 
T0BIN(FR0MBIN(x))=x 
T0BIN(FR0ML0G(T0L0G(x)))=T0BIN(x) 
FR0ML0G(T0L0G(FR0MBIN(x)))=FR0MBIN(x) 
SUCz,oG(x)=T0L0G(SUCi/N(FR0ML0G(x))) 
SUCB/y(x)=TOBIN(SUCL//v(FROMUIN(x))) 
DOUBLELOG(x)=TOLOG(DOUBLEf,m(FROMLOG(x))) 
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DOUBLES/, v(x)=TOBIN(DOUBLE£,w(FROMBIN(x))) 

where the subscript of SUC and DOUBLE is used here to indicate the domain of these 
strictly typed functions. The identities enable SUC and DOUBLE to be simulated over 
the LIN domain by standard algorithms given in the next chapter. When the innermost 
constant or variable of a term belongs to domain LIN, the typing rules ensure that the 
resulting term is expressed in terms of SUC and DOUBLE over the domain LIN together with 
FTLOG(.)= FROMLOG(TOLOG(.)) and FTBIN(.)= FJROMBIN(TOBIN(.)) for which 
standard simulation algorithms are given in the next chapter. 

Each occurrence of a space domain variable that violates the second requirement is elim- 
inated in succession at the expense of increasing the time arity by 1. The calculus permits 
F.O. quantifiers over the variance domain. If a variance domain term y containing a space 
domain variable is present in the argument x of an occurrence P(x) of the predicate variable 
P in the formula, replace the occurrence P(x) by the expression iro[P(x[ro/y])AEQ(ro,y)], for 
some new variance domain variable ro. The existential quantifier is then migrated outward 
to get a prenex form Y(T,S,V)[P,t,r,s](3roF")[w]. 

Finally the quantifier is simulated as follows (the method of the previous lemmas cannot 
be used since it reintroduces a space domain variable in the variance domain term): The time 
arity of the Y-operator is increased by 1, adding a new (least significant) time domain variable 
to. Thus t'=tto and c'=c-t-l. Let TIM2VAR denote the appropriate predefined function 
taking time domain arguments to variance domain values. (Too>Voo by assumption). Let F^ 
denote (P(rs) V F''[TIM2VAR(to)/ro]). The formula Y(T,S,V)[P,t',r,s](F')[w] is equivalent 
to the ori^nal formula and satisfies the requirements of the lemma. 

I 

Theorem 3.2.12: 

Given any formula of the calculus FO+Y(T,S, V) (such that T and S are not both LOG), 
one can construct an equivalent formula in standard form. 

Proof: 

Follows from the previous results. 
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3.3 Logical Expression of PRAM Computations 

In chapter 2, the sequential and parallel models of computation and the resources of interest 
were defined. Then the resource bounded simulation of the DTM model on the PRAM model 
was presented. In the previous .section, the calculus FO+Y(T,S,V) extending first order logic 
was defined and a standard form was demonstrated. In this section, the simulation of the 
PRAM model in the calculus is presented. The simulation of the logic on the DTM model is 
presented in chapter 4. 

We use the techniques of descriptive complexity research to structure the proof of our 
result simulating PRAMs on DTMs. While it appears to us very difficult to present the 
direct simulation of the PRAM on the DTM, the collapsible calculi FO+Y(T,S,V) used as a 
via media turn out to be both powerful enough to express PRAM computations as well as 
(syntactically) simple enough to permit easily presented simulations on the DTM. Thus we 
can use the Y-operator indiscriminately while writing a fonmila exi)ressing the PRAM com- 
putation, and then construct the equivalent formula in standard form before simulating it on 
the DTM. The standard form gives a regular structure to the “global memory access pattern” 
of PRAM algorithms and the normalised algorithm rcprc.scnlcd by the expression in standard 
form can be implemented efficiently in terms mode transitions on the DTM, with a polyno- 
mial blowup in the workspace. The proof is a variant of Immerman’s proof that the class 
CRAM[t(n)]-PROC[0(n*)] is included in the class IND[t(n)]-VAR[2k+2] [Immerman 1987b]. 

For the purposes of this section, conventional hardware bounds h(n) and conventional par- 
allel time bounds t(n) are those of the form O(log(n)'=), 0(n'') and 0*(n'=) (for some constant 
k>l), with the restriction that either t(n)=ft(n) and h(n)=0(poly(t(n))) or h(n)=fi(n) and 
log(h(n))=0(poly(t(n)+log(n))). 

Each processor has a finite set of registers including the PROCESSOR, ADDRESS, CON- 
TENTS and PROG RAM -COUNTER registers. 
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Claim 3.3.1: 

Let h(n) be a conventional hardware bound and let t(n) be a conventional parallel time 
bound. Then, for suitable choice of domains T, S and V, and corresponding domain arities 
c, k and q respectively, the following hold: 

(1) Time instants can be represented as elements of 'P. 

(2) Addresses of bits in processor registers and global memory can be represented as ele- 
ments of V^S*. 

(3) Too^=0(poly(t(n)+log(n))). 

(4) Voo‘‘*Soo^ =0(h(n)). 

Theorem 3.3.2: 

For T,Se{LIN,LOG} (such that 

T and S are not both LOG) and Ve{LOG,BIN:Voo<S<xAVoo<Too}, let A be a CRCW 
PRAM decision algorithm written using only READ, WRITE, ADD, SUBTRACT, MOVE, 
INC/DEC, BLT and HALT instructions, which operates with 0(poly(Voo*Soo)) hardware 
and poly(Too) parallel time. Then, there exists a formula of FO+Y(T,S,V) which expresses 
the language decided by A. 

Proof: 

First we show how the total ordering and BIT predicates can be expressed. The expressions 
are given only for FO-(-Y(LIN,LOG,BIN) and FO+Y(LOG,LIN,BIN), where each of these 
calculi is restricted to space arity 1. These are the tightest restrictions, and from these, 
the expressions for all calculi FO+Y(T,S,V) (T,S€{LOG,LIN};Voo<Soo;T,S not both LOG) 
with and without restriction of space arity, can be obtained easily. 

The total ordering predicate < on domain LIN is defined in FO+Y(LIN,LOG,BIN) as 
(note that the variance arity is zero) 

<(x,y)=Y(P,t„s)[F](LOGoo) 

where 

F=[(EQ(s,LOGl) A EQ(t,x)) V 

(EQ(s,LOGoo) A EQ(t,y) A P(LOGl)) V 
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EQ(x,y) V 
P(s) 

] 

and in FO+Y(LOG,LIN,BIN) as 

<(x,y)=Y(P,t,r,s)[F](3*LINoo) 

where 

' F=((EQ(t,LOGl) A (r=l) A EQ(s,x)) V 
(EQ(t,LOGl) A (r=2) A EQ(s,y)) V 
(^EQ(t,LOGl) A (r=l) A EQ(x,DOUBLE(s))) V 
(-EQ(t,LOGl) A (r=l) A EQ(x,SUC(DOUBLE(s)))) V 
(-EQ(t,LOGl) A (r=2) A EQ(y,DOUBLE(s))) V 
(-nEQ(t,LOGl) A {r=2) A EQ(y,SUC(DOUBLE(s)))) V 
((r=3) A EQ(x,y)) V 
((r=3) A 3z(P(l*z) A P(2*SUC(z)))) V 
((r=3) A 3z(P(l*z) A P(2+SUC(SUC(z))))) V 
((r=3) A 3z(P(l*z) A P(2*SUC(SUC(SUC(z)))))) V 
((r=3) A P(3*s)) 

] 

Total ordering in other domains is easily expressed in terms of total ordering on LIN and 
the predefined functions. In what follows, we use the symbol “<” to denote total ordering in 
all domains. 

The following predicates are FO-definable: 

0DD'(x)=-3y[EQ(x,D0UBLE(y))] 

EVEN'(x)=(3yVz)[EQ(x,D0UBLE(y)) A 
-(-nEQ(y,z) A EQ(y,SUC(z)) A EQ(x,DOUBLE(z))) 

] 

MAX(x)=EQ(x,SUC(x)) 

ODDMAX(x)=[MAX{x) A (3y)hEQ(x,y) A EQ(x,SUC(y)) A EVEN'(y)]] 
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EVENMAX(x)=[MAX(x) A (3y)hEQ(x,y) A EQ(x,SUC(y)) A ODD'(y)]] 
ODD(x)=[ODD'(x) V ODDMAX(x)] 

EVEN(x)=[EVEN'(x) V EVENMAX(x)] 

HALF'(x,y)=hEQ(x,y) A [EQ(x,DOUBLE(y)) V EQ(x,SUC(DOUBLE(y)))]] 
HALF(x,y)=[HALF'(x,y) A ^(3z){-.EQ(y,z) A EQ(y,SUC(z)) A IIALF'(x,z)]] 

The BIT predicate is implicitly FO-definable as follows: 

Bir(x,n)=[[(EQ(n,LINl) A ODD(x)) V 

(3y,m)[HALF(x,y) A -'EQ(n,m) A EQ(n,SUC(m)) A BlT'(y,m)]] A 
(Vy,z)[EQ(y,z) V 

(3m)[(Bir(y,m) A -Bir(z,m)) V 
(BIT'(z,m) A ^BIT'(y,m)) 

] 

] 

] 

BIT(x,n)=(3y)hEQ(x,y) A EQ(x,SUC(y)) A BIT'(y,n)] 

The definition of the BIT predicate in terms of the BIT' predicate is FO. It remains to 
give the explicit definition of the BIT' predicate in the extended calculi. 

In FO+Y(LOG,LIN,niN), BIT' can be defined as follows: 

BIT'(x,n)=Y(P,t,r,s)[F](3*LINoo) 

where 

F=[(EQ(t,LINl) A [EQ(n,LINl) A ODD(x)] A (r=3) A EQ(s,LINoo)) V 

(EQ(t,LINl) A -[EQ(n,LINl) A ODD(x)] A 

[((r=l) A HALF{x,s)) V 

((r=2) A -EQ(n,s) A EQ(n,SUC(s))) 

])V 

(P(3*LINoo) A (r=3) A EQ(s,LINoo)) V 
(-.EQ(t,LINl) A 

[P(2*LIN1) A (3y)(P(l*y) A ODD(y))] A 
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(r=3) A EQ(s,LINoo)) V 

((3x',n')[P(l*x') A P(2*n') A ^(EQ(n',LINl) A ODD(x')) A 
(((r=l) A HALF(x',s)) V ((r=2) A ^EQ(n',s) A EQ(n',SUC(s)))] 

]) 

] 

In F0+Y(LIN,L0G,BIN), BIT' can be directly defined as follows: 

BIT'(x,n)= {n< TOLIN(LOGoo)) A Y(P,t,r,s)[F](2*TOLOG(n)) 
where 

F=[((r=l) A [(-iP(l*s) A (Vz<s)P(l*z)) V (P(l*s) A (3z<s)->P(l+z))]) V 
((r=2) A EQ(t,x) A P(l*s)) V ((r=2) A P(2+s))] 

With < and BIT extending FO, addition and subtraction are known to be expressible 
[StockineyerVishkinl984] [Immerman 1987b]. We now show how to simulate the computa- 
tion of a PRAM M. On input x, lx|=n, M runs in t(n) synchronous steps, using p(n) pro- 
cessors and m(n) global memory cells. We consider only the cases (t(n)=polylog(n) and 
p(n),m(n)=0(n^), k>l) and (t(n)=poIy(n) and p(n),in(n)=0((log(n))^),k>l). Since the 
number of processors and global memory cells are bounded by some polynomial in n (log(n) 
in the second case), we need only a constant number of variables over the LIN domain (LOG 
domain in the second case) and the BIN domain to name any processor or global memory 
cell. Since the number of time steps and the wordwidth of registers and global memory cells 
are bounded by some polynomial in log(n) (n in the second case), we need only a constant 
number of variables over the LOG domain (LIN domain in the second case) to name any 
time instant or bit within a given word. Further, the number of distinct names so created are 
0(p(n)-l-m(n)) and poIy(t(n)-|-width(n)) respectively. We can thus define the contents of all 
the relevant registers and global memory cells at any time instant, while keeping the number 
of variables over LIN (LOG in the second case) in check. 

Say VALUE(loc,tim,pos,booI) has the meaning that the bit in position pos at location 
loc just after step tirn is bool. It is straightforward to write a FO-fY(LOG,LIN,BIN) 
(FO-f-Y(LIN,LOG,BIN) in the second ca,sc) expression for VALUE if the PRAM instruc- 
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tions are expressed. Hy explicitly using the iterative and non-monotonic update abilities of 
the Y-operator for READ, WRITE and MOVE, searching for the last time step at which the 
cell was written to, and the lowest nunihered processor which attempted a WRITE at that 
time, becomes unnecessary. 

It is clear that the initial state can bo expressed, using BIT to express the fact that the 
initial contents of each processor’s PROCESSOR register is its processor number [Immer- 
manl987b]. Addition and subtraction have already boon mentioned. It remains to show that 
BLT (Branch-if-less-than- zero) and IIy\LT are expressible. This follows since the contents 
of the register PROCIlAhLCOUNTER are available as data objects, and the program is of 
finite length. 

I 



Chapter 4 


Relational transformations 


In chapter 2, the sequential and parallel models of computation and the resources of interest 
were defined. Then the resource bounded simulation of the DTM model on the PRAM 
model was presented. In chapter 3, the calculus FO-bY(T,S,V) extending first order logic 
was defined and a standard form was demonstrated, 'fhon the simulation of the PRAM 
model in the calculus was presented. The evaluation of formulas in standard form on the 
DTM model is presented in this chapter. 

The standard form giv«>s a regular structure to tlie “global memory access pattern” of 
PRAM algorithms and the normali.sed algorithm represented by the expression in standard 
form can be implemented “efiiciently” in terms of mode transitions on the DTM (a polynomial 
blowup of the workspace occurs while expressing the algorithm in the calculus). The following 
properties of a formula Y(T,S,V)(P,t,r,s|(F)[w] in standard form are used; 

(1) There are no first-order quantifiers and there is at most one Y-operator, which is at 
the outermost level. This means that the predicate symbol bound by the Y-operator can be 
viewed as the workspace of the DTM computation, and the time arity can be viewed as the 
iteration count of a simple loop. 

(2) The subformula F is propositional. This moans that once the truth values of the 
atomic predicates are known, the subformula F can bo evaluated for each given binding in 


• 11 ) 
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0(1) time by the delta function of the DTM. 

(3) In any (q+k)-tuple x which is an argument of an occurrence of P in F, the first q 
terms of the tuple' (which in ge'neral may lic' const rucled from any constant or variable <an(l 
the predefined functions, but which take values from the variance domain) do not contain 
any occurrence of the variables of s. This means that the new binding of the predicate P can 
be constructed in Voo’ sequential iterations corresponding to the first q terms of the tuple, 
with each iteration involving the computation of a map from S*' to S*'. 

(4) There is no occurrence of the functions SUC and DOUHLE over the LOG and BIN 
domains in the formula. Since all occurrences of the functions SUC and DOUBLE take 
arguments only from the domain LIN, deep nesting of the typecasting functions TOBIN, 
FROMBIN, TOLOG and FROMLOG can be collapsed. The map from S*' to mentioned 
above can be expressed entirely in terms of variables and functions over the LIN domain. 

Note that k-ary predicates over the LIN domain can be viewed as binary strings of length 
n^ stored on a DTM tape. A k-tuple of variables over the LIN domain can be viewed as a 
position of the tape head. A predefined function like SUC can be viewed as a map from head 
positions to head positions. The key idea of this chapter is that k-ary predicates can also be 
viewed as the contents of a ouo-hit register in a n^'-proces.sor fixed connection network. A 
k-tuple of variables over the LIN domain can be viewed as a processor index. A predefined 
function like SUC can be viewed as a data transformation to be achieved by routing messages 
over the network using the properties of the interconnection functions. The way this key idea 
is applied here is by interpreting familiar interconnection functions as functions over binary 
strings and showing that these are efficiently implementable on the DTM. 


4.1 Basic Redistributive Functions 

One rc<[uironiont of parallel areliitoctiires that does not ap[)ly to serial architectures is the 
necessity for rearranging data in order to avoid memory access contention while providing 
fast paralh'l access. A good paralhd aichiti’cture design is highly de[)endent on how efliriently 
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the data manipulating functions required by the intended applications are implemented. The 
circuits used to achieve these functions can be considered to form an independent functional 
block, called a data manipulator (or interconnection) network. 

To rearrange data in shared memory, or, equivalently, to redistribute data over a network, 
a programmed sequence of basic functions must be executed. The choice of basic functions 
to be hardwired determines the set of data manipulations that are achievable as well as their 
efficiency. Several authors have considered such basic function sets, commonly including 
functions from the following four categories: 

1 Bit-Permute-Complement (BPC) permutation [Fraserl976] [NassimiSahnil981b] 
[YewLawriel981] [NassimiSahnil982] 

2 p-ordering and cyclic shift within segments [Lawriel975] [Lenfantl978] [YewLawriel981] 
[NassimiSahni 1981b] 

3 broadcast [Fengl974] [Thompsonl978] [Siegell979] [Parkerl980] [NassimiSahnil981a] 

4 masking [Fengl974] [Siegell977] [Siegell979] 

In simulating FO+Y(LOG,LIN,V) formulas on the multitape DTM model of computation, 
it will prove necessary to rearrange the tape contents to reduce the number of mode transi- 
tions. Pippenger [Pippengerl979] proposed to use sorting network techniques [Batcherl968] 
[Stonel971] for reducing the number of reversals. We, of course, have the additional require- 
ment of ensuring that the workspace used is optimal within constant factors. 

We define a set of data manipulation functions M and a set of housekeeping functions 
H and show that they can be computed on multitape DTMs using 0(n) workspace and 
polylog(n) mode transitions on inputs of length n. (E={0,1}) (We give actual constructions 
in pseudocode for each function.) The proof of the simulation result makes extensive use of 
these functions. 

Some of the function names have subscripts which are integers. These subscripts are 
additional inputs to the corresponding functions. We use the following encoding of integers 
as strings: The empty string represents the integer 0. Strings from the set {1}+ represent 
positive integers. Strings from the set {O}"*" represent negative integers. Other strings are 
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not interpreted <ls numbers. Integers encoded as above are also interpreted as strings. 

Let a string xeS*, 1x1=2"*, be laid out on a DTM tape. The leftmost symbol of x is said 
to have address 0, and, incrementing to the right, the rightmost symbol has address (2"*-l). 
A data manipulation function fGM is specified by declaring that the symbol at address i 
in the string f(x) is the same as the symbol at address r(i) in string x, for some function 
f on integers whose domain is [0-(2"*-l)] and range is a subset of [0-(2"*-l)]. We refer to 
the function f as an address transformation and to the function f as the corresponding data 
transformation. (The prime ' distinguishes an address transformation from the corresponding 
data transformation.) We refer to the binary expansion of address i as [im-iim- 2 ' • diio] where 
i=i,„_i2"*-i-t-i,n-22”*-2+- • •+ii2i+io20. 

The set M contains the following functions: 

1 Outer Shuffle so^: 

The address transformation is 
so Tn([im— llm— 2' ’ diio])— [Im— 2' ' diioim— l] 
or equivalently, 
so',n(i)=2i if 0<i<(2"‘-Ll) 

2i+l-2"* if 2"*-i<i<(2"*-l) 

2 Inner Shuffle sim^ 

The address transformation is 


si'm([im-lim-2- ' •iiio])=[im-lim-2- ' doil] 
or equivalently, 
si'm(i)=^ (imod4)€{0,3} 

4*[(t/4)j4-3-(imod4) if 2"-i<i<(2"*-l) 

3 Skew skc.mj for each 2<c<m: 

The address transformation is 

sk'c,m([im-lim-2- ' dc+licic-r ’ •ilio])=[im-lim-2- ' dc+licio- ' doio] 
or equivalently, 

sk'e,^(i)=2‘=*L(V2")J+(2=-l)*(imo<i2) 
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The last c bits of the address are replaced by the last bit extended c times. 

4 Select selc^, for each c€S, i.e. seiOm and sell„i: 

The address transformation is 
Selc^ni([im-lim-2' * ■ilio])=[iin— li7n-2“ "iic] 

or equivalently, 

selc'm(i)=2*L(t72)J+c 

The last bit of the address is replaced by c. 

5 Successor succ,^} for each 2<c<m: 

The address transformation is 

suc'c,m (i )=2'* [(*72*= )J +min((2‘=- 1 ) ,((imod2‘=)+ 1)) 
or equivalently, 

suc7,tn(i)= i+1 if (imod2‘=)<(2‘^-l) 
i if (imod2'^)=(2‘^-l) 

Increment if there is no carry out of the bit (i.e. saturating increment of the last c 
bits). 

6 Double dbc.m, for each 2<c<m: 

The address transformation is 
db'c,m(i)=2°*L(i/2‘')J+min((2‘=-l),(2(imod2=)+l)) 
or equivalently, 

db7,m(i)= 2=*l.(i/2")J+(2(imod2^)+l) if (2(imod2‘=)+l)<2= 

2^* L(* 72 ‘^)J +(2‘=-l) if (2(imod2‘=)+l)>2^ 

Saturating double of the last c bits. This definition may appear nonintuitive. This is 
because the interpretation of domain elements as numbers that was chosen for the logic 
mapped the LIN domain to the range [1- n] while the addresses are in the range [O-(n-l)]- 
We have defined the function this way to simplify the simulation of the logic. 

7 Log Truncate trlogc,nn for each 2<c<m: 

The address transformation is 

trlog'c,m(i)= 2 ‘=*L( 727 J+(c-f) if (iniod2=)>(c-l) 
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i if (imod2‘^)<(c-l) 

8 Const Truncate tr2c,ni, for each 2<c<m: 

The address transformation is 

tr2'c,m(i)= 2‘^*[(i/2‘=)J+l if (imod2‘^)>l 

i if (imod2‘^)<l 

The functions in M are defined in terms of address transformations, so the length of the 
output string is the same as the length of the input string. Also these functions are computed 
by circuits with no gates: each output bit is the image of a particular bit of the input. Thus 
they may be described as non-inverting, length preserving projections. 

Apart from these, we shall also need some functions that take more than one argument 
and some functions whose output string differ in length from that of the input. We describe 
below the set H of such functions. The functions in H perform string manipulations and 
arithmetic computations. 

The set H contains the following functions: 

1 zero: This is a function with no argument which returns the integer 0. 

2 dec: On input integer n, returns the integer n-1. 

3 inc: On input integer n, returns the integer n-1-1. 

4 half: On input integer n, returns the integer L(^/2)J • 

5 doub: On input integer n, returns the integer 2*n. 

6 log: On input integer n, returns the integer logn if n>0 and returns -1 otherwise. 

7 exp: On input integers m, n, returns the integer min(m,2'*) if m>0 and n>0 and returns 

0 otherwise. 

8 sub: On input integers m, n, returns the integer m-n. 

9 add: On input integers m, n, returns the integer m-f-n. 

10 abs: On input integer n, returns the absolute value of n. 

11 sign: On input integer n, returns the integer n/Abs(n) if n^^O and returns 0 otherwise. 

12 if: On input integers 1, m, n, returns the integer m if Ij^O and returns n otherwise. 

13 nil: This is a function with no argument which returns the empty string. 
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14 sym: On input integer n, returns the string of length 1 consisting of the symbol of 
the alphabet if n>0 and n.<l S | and returns the empty string otherwise. 

15 len: On input string x, returns the integer length-of-x. 

16 head: On input string x and integer n, returns the string consisting of the first 
min(len(x),max(n,0)) symbols of x. 

17 cone: On input strings x, y, returns the string obtained by concatenating x and y. 

18 mask: On input x€E+, y€S*, mask outputs the (unique) string of length |y| in the set 
{z|z is a prefix of an element of {x}*}; i.e. it outputs the symbols of x in left-to-right order 
over and over again till the number of symbols output is |y|. 

19 comb: On input XjyjZSS*, |x|=|y|=| 2 |, comb outputs a string of length |xj using the 
bits of X to select between the bits of y and z. The k*^ bit output by comb is the k‘* bit of y 
if the k‘^ bit of x is 0, and the k‘^ bit of z otherwise. 

20 find: On input strings x, y, returns string z of length equal to that of y such that 
the n‘^ symbol of z is 1 if their is an occurrence of the first min(len(x),f(/ojr 2 (fen(y) + 1))]) 
symbols of x as a substring of y starting from the n*^ symbol of y and is 0 otherwise. 

These string manipulation functions are illustrative and other (easy) string operations 
assumed to be available may be identified from the implementations presented below. 

4.2 Two Lemmas about Mode Transitions 

In presenting the implementation of the basic redistributive functions on the DTM, it will be 
convenient to use a resource-bounded composition schema which, given efficient implementa- 
tions of two functions, returns an efficient implementation of their composition. While such 
schemas are well known for complexity measures like runtime, workspace and reversals, it is 
not obvious that such a schema exists for mode transitions. Here we prove two lemmas that 
show the flexibility of this resource for efficient programming. 

Lemma 4.2.1: 

Let M be any DTM with cycle-time w and let k be any integer greater than 0. Let M be the 
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DTM obtained from M by padding the tape alphabet of M with enough new symbols so that the 
cycle-time of M becomes k*w, with no change in the delta function. Then the computation of 
Af is the same as that of M on each input, the runtime, workspace and reversal complexities 
are the same for M and Af and if the mode transition complexity of M on inputs of length n 
is x(n), then the mode transition complexity of M is 0(x(n)). 

Proof: 

The only case needing detailed treatment is that of mode transitions. To see that the 
claim holds, note that the expected head movement remains the same if the mode tuple is 
lengthened by concatenating several copies of the original mode. Since the computations are 
identical on each input, one may refer to corresponding steps of the two machines. Let a step 
at which a mode transition by M occurs be referred to as a cutpoint. Let the steps between 
cutpoints be referred to as belonging to the immediately previous cutpoint. (Cutpoint steps 
belong to themselves.) Let the mode transitions of M' occuring at steps belonging to a 
cutpoint be referred to as belonging to the mode transition of M at that cutpoint. The 
number of mode transitions of M' belonging to any mode transition of M is clearly at most 
k*w. The mode transitions of M' accounted for in this way are the only mode transitions 
made by M'. Thus if the number of mode transitions used by M on inputs of length n is at 
most x(n), the number of mode transitions used by M' is at most k*w*x(n). 

Note that the mode transition complexity of M' can easily be less than that of M. Suppose 
that M is actually following a cyclic pattern of length k and k is mutually prime to w. Then 
the mode transition complexity of M could be high while that of M' is low. 

I 

Lemma 4.2.2: 

Let Ml and M-z compute functions f() and g() and let their mode transition complexi- 
ties on inputs of length n be xi(n) and xz(n) respectively. Then there exists a DTM A /3 
computing function f(g()) with the number of mode transitions used on input string y being 
0(xi(\g(y)\)+xa(\y\)). 


Proof: 
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Construct M 3 in the ordinary way and use the construction of the previous lemma to 
adjust its cycle-time to some common multiple of the cycle- times of Mi and M 2 . 

I 


4-3 Implementation of Basic Redistributive Functions on the 
DTM 

The implementation of the basic redistributive functions is presented in pseudocode. The 
pseudocode convention is as follows: 

The language is pseudo-C. Each tape is represented as a one-dimensional array. Each 
location of the array corresponds to a location of the tape. The numbering of the locations is 
as per the convention that the head positions at the start of the computation are numbered 0 
and incrementing is to the right. (Thus a negative array index represents a head position to 
the left of the starting position.) Upper case variables represent either constants or variables 
bound outside the scope of the algorithm being presented. Lower case variables represent 
machine state and iterator variables bound within the algorithm being presented. Such 
iterator variables usually refer to head positions of tapes. Note that there are implicit tape 
rewinds when these iterator variables are initialised. 

By definition, sweeps take 0(1) mode transitions. To represent these in pseudocode, we 
introduce the sweep pseudo-statement, whose syntax and semantics is the same as the for 
statement, except for the use of the keyword sweep in place of the keyword for. The use 
of the sweep pseudo-statement constitutes an assertion that the code embedded in its scope 
does not cause a mode transition and that the entire for-loop can be implemented with 0 ( 1 ) 
mode transitions. This distinguishes sweeps from those for-loops in the pseudocode where 
the embedded code requires one or more mode transitions. In the embedded code of a sweep 
statement, head positions on different tapes are specified as linear functions of the iterator 
variable. This indicates how synchronised cycles of head movements on the tapes involved 
can be set up, reducing the number of mode transitions. 
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The implementation of the basic redistributive functions is now presented as programs 
in pseudocode. Each input and the output including each additional input is on a separate 
tape. The efficient composition schema of the previous section is used implicitly. The use of 
arithmetic expressions in the pseudocode is to be understood as representing occurrences of 
the housekeeping functions for arithmetic computation. The portion of tape accessed by a 
sweep statement may be thought of as a string in which case many of the sweep statements 
in the pseudocode could be replaced by the string manipulation functions. This is not done 
here for uniformity. 

1 Outer shuffle so^: Using a temporary tape Array an inplace outer shuffle of the 
tape Arrayo/ti is obtained. 

begin 

sweep(q=0;q<(2”*/2);q++)Array„eu;[q] ^Arrayoj<i[2q]; 
sweep(q=0;q<(2"‘/2);q++)Array„eiu[(2"'/2)+q]<-Arrayo/<i[2q+l]; 
sweep(q=0;q< 2'";q++)Arrayof<i[q] -^Arraynetulq]; 
end 

2 Inner shuffle si,„: Using a temporary tape Array„en„ an inplace inner shuffle of the tape 
Array oij is obtained. 

begin 

sweep(q=0;q<(2’"/4);q++)Array„e^[4q] f-Arrayou[4q]; 
sweep(q=0;q<(2'"/4);q++)Array„e,u[4q+l]+- Array ow[4q+2]; 
sweep(q=0;q<(2’”/4);q++)Arraynetu[4q+2]*-Arrayo/<i[4q+l]; 
sweep( q= 0;q< (2"* /4);q4- + ) Array„e«,[4q+3]<— Array [4q+3] ; 
sweep(q=0;q< 2’";q++)Arrayo/(i[q] <-Arrayneu;[q]; 
end 

3 Skew skc,OT, for each 2<c<m: The last c bits of the address partition the data into 2'" 
interleaved blocks each of which is a regularly spaced sequence of length 2”* and spacing 2 . 
In a skew sk^.m, the last c bits of the address are replaced by the last bit extended c times. 
The two values of this last bit correspond to two of these 2= blocks, one of which consists of 
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the first bit of data and every (2'^)*^ bit thereafter, while the other consists of the bit 

of data and every ['2‘y^ bit thereafter. The skew is achieved by interleaving these 2 blocks to 
form a sequence 012”* bit-pairs and duplicating each bit-pair 2‘^~^ times. Using temporary 
tapes A, B and Array an inplace skew skc^m of the tape Array is obtained: 
begin 

swe€p(q=0;q<2'';q-f-}-)B[q]^0;B[2'=-l]<-l; 

for(i=0;i<(m-c);i-l-4-){ 

sweep(q=0;q<2('=+‘);q+-f-)A[q]^B[q]; 

sweep(q=0;q<2(''+‘);q-|-l-)A[q-l-2(‘=+')]<-B[q]; 

sweep(q=0;q<2(‘^+‘+');q-|-l-)A[q]<-B[q]; 

} 

mode=WAITING; 

sweep(q=0;q< 2”* ;q-t— I- )s wi tch(mode){ 

case WAITING:data= Array oi<i[q];Array„eti;[q]=data;mode=SKIPPING;break; 
case SKIPPING;if(B[q]==l)mode= WAITING; else mode=COPYING;break; 
case COPYING:.A.rray„etu[q]=data;mode=SKIPPING;break; 

} 

swcep(q=0;q<2'';q+-|-)B[q]<-0;B[0]<-l; 

for(i=0;i<(m-c);i+-l-){ 

sweep(q=0;q<2(®+'^;q-t-l-)A[q]«-B[q]; 

sweep(q=0;q<2(‘^+');q-l-+)A[qH-2(‘=+‘)]<-B[q]; 

sweep(q=0;q<2^‘^'''‘+^);q-f-l-)A[q]<-B[q]; 

} 

mode=WAITING; 

sweep(q=(2"*-l);q>=0;q-)switch(mode){ 

case WAITING:data= Array o/(i[q];Array„eu,[q]=data;mode=SKIPPING;break; 
case SKIPPING:if(B[q]==l)mode= WAITING; else mode=COPYING;break; 
caseCOPYING:Array„,,;,[q]=data;mode=SKIPPING;break; 
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} 

s weep( q= 0 ;q < 2’" ;q+ + ) Array ow[q] — Array„eu; [q] ; 
end 

Note that the entire switch statement is first-order involving no head movement and 
therefore can be executed in one step. 

4 Select selc^, for each c€S, i.e. sel0„i and 8611^: The last bit of the address is replaced 
by c. Using temporary tape Array„eu,, an inplace select selc^ of the tape Array is obtained: 

begin 

mode= WAITING; 

sweep(p=0ip<2’”;p+-t-)switch(mode){if(c==:0)q=p;else q=(2’”-l-p); 
case WAITIN G :data= Array o/d [q] ; Array [q] =data;mode= COP YIN G ;break; 

case COPYING:Array„eu;[q]=data;mode=VVAITING;break; 

} 

sweep(q=0;q< 2" ;q-f- + ) Arrayo/d [q] ^ Arraynew [q] ; 

end 

5 Successor succ,^, for each 2<c<ra: Increment if there is no carry out of the bit 
(i.e. saturating increment of the last c bits). Using a temporary tape Arraynetu, an inplace 
successor of the tape Array oW is obtained. 

begin 

for(i=0;i<(m-c);i-l--f)soOT; 

/* This separates the interleaved blocks */ 
sweep(q=0;q<2”*;q-|-t-)Arraynetu[q]=Arrayo/d[q-}-2‘=]; 

/* This does the job for all but the last value of the c-tuple */ 

/* Since SUC(2‘^)=2‘^ itself, junk data got written in the last block */ 

/+ Recover from this by restoring the old data in the last block */ 
sweep(q=0;q<2=;q-l-|-)Array„e,„[2"*-l-q]=Array„/d[2'"-l-q]; 

sweep(q=0;q<2’";q-|-+)Arrayo/d[q]=Arrayne«/[q]; 

/+ Now the separated blocks have to be interleaved again */ 
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for(i=0;i<c;i++)so„; 

end 

6 Double dbc.mi for each 2<c<in; Saturating double of the last c bits. Using temporary 
tapes Array„e^, Array j and Array 2 , an inplace double of the tape Arrayow is obtained: 
begin 

(sOm)'"“‘^'''Hso„sim ; 

/* This brings data from even addresses to the front in each block */ 

/* The last elements in each block of size 2' are wanted */ 

(sOto)”'“^; /* This separates the interleaved blocks +/ 

/* The second half of Array o/d becomes the first half of Array „etu */ 
sweep(q=0;q<2'”“^q++)Array„eu,[q]=Arrayo/d[q+2’”“^]; 

/* Since D0UBLE(x)=2‘^ for all x in the range 2‘^“^<x<2‘^, */ 

/* The last data element of the ori^nal block has to be copied here */ 

I* this is done simultaneously for all blocks as follows: */ 

/* The last valid data block starts at offset (2‘^-l)*2”*“‘^ */ 

/* Copy this for later use and make 2'="* copies: */ 
j=(2=-l); 

for(i=0;i<(m-c);i+ + )j=2*j; 

sweep(q=0;q< 2’"“‘^;q+ + ) Array i [q]= Array o/<i[q4-j] ; 

1=2”*-'=; 

for(j=0;j<(c-l);j++){ 

sweep(q=0;q<l;q++)Array2[l+q]=Arrayi[q];l=2*l; 

sweep(q= 0 ;q<l;q++)Arrayi[q]=Array 2 [q]; 

} 

/♦ The contents of Arrayi become the second half of Array „ett; */ 

sweep(q=0;q<2”*-';q-|-+-)Arrayneu;[q+2”*“^]=Arrayi[q]; 

sweep(q=0;q<2”*;q+4-)Arrayo/d[q]=Array„eu;[q]; 

(sOjn)*^; /* The .separated blocks are interleaved again +/ 
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/* The output is obtained in Arrayo/j itself */ 
end 

7 Log Truncate trlogc^^, for each 2<c<m. Using a temporary tape Arrayi, an inplace 
double of the tape ArrayoW is obtained: 

begin 

(som I* This separates the interleaved blocks +/ 

/♦ The first (c-1) blocks each of size are now correct */ 

/♦ The last valid data block now starts at offset (c-l)+2’^“'= */ 

/♦ Copy this for later use and make 2'^ copies: */ 
j=(c-i): 

for(i=0;i<(m-c);i++)j=2*j; 

sweep(q=0;q<2"‘-‘';q++)Arrayi[q]=Arrayoid[q+j]; 

1 = 2 '”-=; 

for(j=0u<cu++){ 

sweep(q=0;q<l;q++)Array2[l+q]=Arrayi[q];l=2*l; 
s weep( q = 0 ;q < 1 ;q + + ) A r ray 1 [q] = Ar ray 2 [q] ; 

} 

j=c; 

for(i=0;i<(m-c);i++)j=2*j; 

sweep(q=0;q<2'”;q++)Arrayow[q+j]=Arrayi[q]; 

(som)'^; J* The separated blocks are interleaved again */ 

/* The output is obtained in Array ojd itself ♦/ 
end 

8 Const Truncate tr2c,m, for each 2<c<m: The implementation of tr2c,Tn is similar to that 
of trlogc.m- 

The implementation of the housekeeping functions is straightforward. Details are omitted. 
The claims of efficiency follow readily from the implementations. 
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4.4 Relational Transformations 

Consider the calculus FO+Y(T,S,V) with T=LOG, S=LIN and Ve{LOG,BIN}. Let the 
binding of a k-ary predicate P over S be represented on a DTM worktape as follows: The 
truth- values of the predicate for all tuples s=< siS 2 ---Sk > are laid out as YES/NO bits 
(YES=1;NO=0) on this tape from left to right in lexicographic order of s, with sj being 
the most significant variable over S and s*, being the least significant variable over S for this 
purpose. For example, if P{S1*“'*SUC(S1)) is true, then the second non-blank symbol from 
the left is 1. When Soo is not a power of 2, gaps are left in the data structure for technical 
reasons. This is done as follows: 

Let Z be a totally ordered domain of size with first element Z1 and last element 

Zoo. In general, Zoo>LINoo=Soo. Let LIN2Z(.) be the function embedding LIN as the 
prefix of Z. The truth-values P(s) are laid out at locations indexed by LIN2Z(s), using a total 
of Zoo* rather than Soo* tape cells. Note, however, that Zoo*=0(Soo*). 

With this representation, the position of the head on the tape constitutes an encoding of 
a particular tuple s (ignoring the gaps). The truth- value of the predicate on a tuple s may be 
obtained by positioning the tapehoad at the corresponding location and reading the contents 
of the tape cell under the head. 

Consider a k-tuple x of constants, variables and derived terms of domain S. In general, x 
contains occurrences of constants and variables over the BIN, LOG and LIN domains, with 
the proviso that the only variables over the LIN domain that occur in x are those of s. The 
tuple X gets bound to specific ground tuples when the variables of s are given values (in the 
context of given bindings for the other variables in x, referred to as a suitable context). The 
lexicographically ordered sequence of all the tuples s induces an ordered sequence of ground 
tuples X in a suitable context (possibly with multiple occurrences of some ground tuples in 
the induced sequence). The truth values of the predicate for this ordered sequence of ground 
tuples X can be laid out as YES/NO bits on a DTM worktape (with gaps if necessary) in the 


manner described above. 
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Such a tuple x is said to be a relational transformation. 

The Zcc*-bit string representing the binding of a k-ary predicate P is the input. As s 
varies le.xicographically from Si* to Soo*, x addresses various locations in the input. The 
data at these locations (in the order of access) has to be written on the output tape to 
achieve the transformation. This constitutes a transformation of Soo*-bit input sequences 
into Soo*-bit output .sequences (ignoring the (Zoo*-Soo*) bits in the gaps). Each such k- 
tuple X specifies a relational transformation and the set of all such tuples specifies the class 
of relational transformations. One may say that the input sequence is in s-order and the task 
is to generate an output sequence in x-order. 

Since x is a k-tuple of elements drawn from the Herbrand Universe of constants, the 
(bound) variables of the BIN and LOG domains (given as a suitable context), the (unbound) 
variables of s and predefined functions (which are unary), one can obtain a general transfor- 
mation (of this class) by function composition from two types of simpler transformations, as 
shown below. 

Let s' be a k-tuple obtained by permuting the elements of s. The transformation specified 
by s' is a permutation on Soo*-bit sequences, since the pair <s,s'> defines a permutation 
over number.s in the range [l-Soo*]. 

Let s' be a k-tuple obtained by replacing the last c (for some l<c<k) elements of s by 
a c-tuple of terms drawn from the Herbrand Universe of constants, the (bound) variables 
of t and r, the (unbound) variable sjt and the predefined functions (which are unary), (sfc 
is the k‘* element of s). Then the pair <s,s'> defines a function b(.) over numbers in the 
range [1-Soo*] having the property that m<n iff b(m)<b(n). The transformation of Soo*-bit 
sequences specified by s' is said to be a c-broadcast. 

Permutations are lossless while broadcasts are order-preserving. Any particular x can 
be decomposed into permutations and broadcasts. Broadcasts require the identification of a 
single sequence of length and duplication of each bit times. 

Permutations are shown to be further decomposable into relational outer shuffles and 
relational inner shuffles. The relational outer shuffles and relational inner shuffles referred 
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to here are variants of the outer and inner shuiBes defined among the basic redistributive 
functions for data manipulation. There, addresses are binary, and a string of 2 ”^ bits uses 
m-bit addresses. Now consider a variant in which the address is k digits and each digit is 
Soo-ary, giving an array-size of Sco*'. 

Define the relational outer shuffle rso* as: 

rsojfe: Array„eu,{si.S2- • ■Sfc-2Sfc-iSjk]<-Arrayojrf[s2- • •Sfc_2Sfc_iSiSi] 

and the relational inner shuffle rsi* as: 

rsijt: Array „,„,(siS2- • •sjt_2SA:-iSfc]<-Array£,/d[siS2- • •Si_2SjtSfc_i] 

to be used as primitive transformations. It is easy to see that the permutations of interest 
can be obtained by function composition from rsofc and rsi^;. 

When k=l, there are no non-trivial permutations and the only transformations are broad- 
casts with c=l. When k= 2 , the outer shuffle and inner shuffle permutations are identical: 
rso2=rsi2: Arrayneu;[siS2]^Arrayo/<i[s2St] 

With the standard representation of a matrix as a vector, this corresponds to: 

For l<io<Soo, rso2=rsi2: Arrayneu;[Soo*(i-l)-+j]<— ArrayoM[Soo*(j-l)-|-i]. 

It is not obvious how this can be computed within the given resource bounds. However, 
gaps have be<*n left in the data structure, so it is actually enough to compute: 

For l<iJ<Zoo, rso2=rsi2: Array „eu,[Zoo*(i-l)-fj]<- Array o/i[Zoo*(j-l)+i]. 

This is achieved by the following algorithm:(Zoo= 2 ^*^®®°) 
for(p=l;p<LOGoo;p-l-+){so2,LOGoo(ArrayoM>Array„eu,);} 

To see why this works, note that each element of the tuple s can be written as a LOGoo- 
tuple of binary variables, so that s is equivalent to a (k*LOGoo)-tuple of binary variables. 
Such a tuple can be treated as a “node address” for an interconnection network, with 
m=k*LOGoo, which allows us to use the two primitives, so^ and sim- Now the transfor- 
mation rsoa can be .seen to be simply so2»ioGoo^^^°° ^ i-®- so2,£,ogoo iterated LOGoo times. 
In the algorithm above, each of the LOGoo iterations computes s02*l0Goo- This generalises 

to k> 2 , so that the following algorithm computes rso^ for k>l: 

LOGoo ^ 


Tsok=sok,Loaoo 
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To compute rsijt, let denote the inverse of sok*LOGoo’ 

inverse(sOfc,£,oc?c>o)=sofc«£,oGoo^^~*^*^^*^°°- (More efficient implementation is possible.) 
Then 

rsijt=SOi:.£,OGoo^^^'”[s 04 ,£Q(j 5 o(si]fc,£OffooSOit*£OGoo)^*^^°° IjiOGoo 

This is in postfix notation, i.e. first sOk»ioGoo is to be iterated LOGoo times, then 
®®jt«£,OGoo computed once, and so on. 

rso 2 =rsi 2 is just a matrix transposition: the input is in row-major form and the output is 
required in column-major form. For k>2 also, one may think of rso^t and rsijt as transpositions 
of multi-dimensional arrays. 

This covers the permutations. Next it is shown how to compute broadcasts. Two cases are 
separately considered: those c-broadcasts in which c=l and those c-broadcasts in which each 
of the last c elements of s' is just sj.. The first type of broadcast is referred to as a calculation 
and the second type of broadcast is referred to as a relational skew. It can be seen that 
an arbitrary broadcast can be decomposed into a functional composition of permutations, 
calculations and relational skews. Further, in view of the gaps in the data structure, it can 
be seen that relational skew can in turn be decomposed into a functional composition of 
permutations and primitive skews. 

In a calculation, the last element of s is replaced by an essentially constant term or a 
function of Sfc to give s'. (An essentially constant term has a constant or a (bound) variable 
over the BIN or the LOG domain where the binding is given as part of a suitable context. 
The predefined functions may also occur.) If the last element of s' is constant or essentially 
constant, in view of the gaps in the data structure, it can be decomposed into a functional 
composition of permutations and selects. 

If the last element of s' is a function of st (i.e. an element of the Herbrand Universe of 
the (unbound) variable sjt and the (unary) predefined functions), then as s* varies from Si 
to Soo, the last element of s' selects regularly spaced subsequences of size Sool*-‘l for each 
value of s*. The calculation is achieved by interleaving these Soo subsequences to form a 
sequence of length Soo*. In view of the gaps in the data structure, the algorithm actually 
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interleaves Zoo subsequences of length Zoo*“^ each. There is a set of primitive algorithms in 
terms of w'hich each predefined function (TOBIN, FROMBIN, TOLOG, FROMLOG, SUC 
and DOUBLE) and their functional compositions can be expressed. 

The algorithm for the function SUC over the domain LIN is obtained in terms of the 
primitive algorithm for suCc,Tn and the masking and combine functions. 

The algorithm for the function DOUBLE over the domain LIN is obtained in terms of the 
primitive algorithm for dbc,n» and the masking and combine functions. 

The algorithm for the function FTLOG(.)=FROMLOG(TOLOG(.)) is obtained in terms 
of the primitive algorithm for trlogc,m and the masking and combine functions. 

The algorithm for the function FTBIN(.)=FROMBIN(TOBIN(.)) is obtained in terms of 
the primitive algorithm for tr2c,m and the masking and combine functions. 

4.5 Evaluation of F04-Y(T,S,V) Formulae on the DTM 

Theorem 4.5.1: 

Let F be a formula with time domain T, space domain S, variance domain V and space- 
arity k>l (such that T and S are not both LOG). Then for a constant m depending on F but 
independent of the input size n, there is a DTM M operating in OfVoo’’' Soo^ ) workspace and 
Of Too ”* ) mode transitions which recognizes the language express^ by F . 

Proof: 

When the time domain is LIN, a naive simulation satisfies the constraints, since 
Soo>LOGoo and k>l, the workspace used for indexes, pointers, counters etc. is 0(Soo*'). 
The only case needing detailed treatment is T=L0G,S=LIN,V6{L0G,BIN}. W.l.o.g. it is 
assumed that the given formula is in standard form (Y(LOG,LIN,V)[P,t,r,s](F)[v] with time 
arity c, variance arity q and space arity k). 

For constructing the predicate defined by the Y-operator there is a “main” worktape. 
The current truth-values of the predicate P for all tuples r*s are laid out as YES/NO bits 
(YES=1;NO=0) on this tape from left to right in lexicographic order of r+s, with ri being 
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most significant and s* being least significant. When Soo is not a power of 2, gaps are left 
in the data structure. This is done n the same manner as described in the previous section. 
The truth-values P{r*s) are laid out at locations indexed by r*LIN2Z(s), using a total of 
Voo’Zoo*= rather than Voo^Soo* tape cells. Note, however, that Voo9Zoo^=0(Voo5Soo*). 

Corresponding to each occurrence P(x) of the predicate in the formula F, there is an 
auxiliary worktape. In general, x contains occurrences of the variables of t, r and s and 
gets bound to specific ground tuples when t, r and s ajre given values. Let t be a tuple of 
constants. The lexicographically ordered sequence of all the tuples r*s induces a sequence 
of ground tuples x (possibly with multiple occurrences of some ground tuples). The truth 
values of the predicate for this particular occurrence with this ordered sequence of ground 
tuples X are laid out as YES/NO bits on this auxiliary tape (with gaps if necessary). Again 
this is similar to the previous section, with the proviso that here terms over the variance 
domain also occur in the tuple. 

The predefined predicates INP and EQ are handled in a similar manner. For each oc- 
currence of these predicates with argument tuple x, an auxiliary worktape is used on which 
the truth values of the predicate for the sequence of ground tuples x induced by the lexi- 
cographically ordered sequence of all tuples r*s for a given tuple of constants t are stored. 
0(Voo''Soo*) space is used on the main worktape and on each of the auxiliary worktapes for 
this storage. 

The computation consists of an initialisation phase, a construction phase of Too^ iterations 
and a reporting phase. In the initialisation phase, the main worktape is written with NO 
bits corresponding to an empty relation (everywhere false predicate). Each iteration of the 
construction phase consists of an indexing stage and a commit stage. In each indexing stage 
the contents of the auxiliary worktapes are updated to match the current contents of the 
main worktape. (The iteration number fixes the value of t). 

In each commit stage, the heads of the main worktape and the auxiliary worktapes make 
a single simultaneous sequential (left to right) sweep without any mode transitions. The 
position of the head corresponds to the instantaneous value of the tuple r*s. Since the 
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formula is in standard form and the truth-values of the predicate occurrences are available 
at the auxiliary head positions, the new truth-value of the predicate P for tuple r*s can be 
computed in 0(1) time, i.e. by the delta function of the DTM itself. This new value is 
written on the main worktape at the current location of the head. 

Finally, in the reporting phase, the final value of the predicate P constructed on the main 
worktape is tested on the tuple v, and the DTM accepts iff this test succeeds. In pseudo-code, 
this algorithm may be summarised as follows:(Ml, M2 and M3 are the number of occurrences 
of P, INP and EQ respectively and pf is the propositional formula in F)) 
function(INP:predicate-over(S)):bool 
tape-alloc P,PAuxl,- • ^PAuxMl, 

InpAuxl,- ■•,INPAuxM2, 

EqAuxl,- • •,EqAuxM3:predicate-over(V?S*'); 

begin P<-{}; 

for t=Tl® to Tcx)*^ do 

begin PAuxl«-update-PAuxl(P);- • -PAuxMl<-update-PAuxMl(P); 
InpAuxl«-update-InpAuxl(INP);- • •InpAuxM2<-update-InpAuxM2(INP); 
EqAuxl<-update-EqAuxl();- • -EqAuxMS^-update-EqAuxMSO; 

P<-pf(PAuxl,- ■ •,PAuxMl, InpAuxl,- • •InpAuxM2,EqAuxl,- • -jEqAuxMS); 
end; 

if(P(v))return(TRUE); else return(FALSE); 
end. 

To complete the proof it remains to show how the update of the auxiliary worktape 
contents in the indexing stage is achieved. This has to be done within the resource bounds 
claimed in the statement of the theorem. 

First the update of the auxiliary tapes for the predicate P is described. Let x be a (q+k)- 
tuple that is part of an occurrence of P in F. As per assumption, the first q components of x do 
not contain any occurrence of the variables of s. It is shown that the update of the auxiliary 
worktape corresponding to this occurrence can be achieved in simultaneous 0(Voo^Soo ) 
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space and po!y(Too) mode transitions. The value of t is fixed during an update. 

Let the last k components of x constitute the k-tuple x'. The update algorithm consists 
of Vco? iterations, indexed by r. Each iteration has the values of t and r fixed, so the first q 
components of x are essentially constant terms. The contiguous Zoo^-bit section of the main 
worktape indexed by these q components is identified. As s varies lexicographically from SI* 
to Soo*, x' addresses various locations in this section (during one iteration). The data at 
these locations (in the order of access) have to be written on the auxiliary tape to complete 
one iteration of the update algorithm. This constitutes a relational transformation in the 
sense of the previous section and may be implemented as such. 

This completes the description of the update of the auxiliary tapes for the predicate P. The 
update of the auxiliary tapes for the predicate INP is handled in exactly the same manner. 
The update of the auxiliary tapes for the predicate EQ is also handled in exactly the same 
manner when k>l. It remains to show how the update of the auxiliary tapes for the predicate 
EQ is handled for the case k=l to complete the proof of the theorem. 

EQ is a binary predicate. Let k=l. Both the arguments of EQ are terms drawn from 
the Herbrand Universe of constants, essentially constant terms, the (unbound) variable si 
and the predefined functions. When both arguments are constant or essentially constant, 
the update is carried out by writing Zoo copies of the truth-value of their comparison on the 
auxiliary tape. 

When one of the arguments is a constant or essentially constant term E and the other 
is a term containing si, the update is carried out in three steps. First the auxiliary tape is 
written with Zoo copies of FALSE(=0). Next the value TRUE(=1) is written at location 
si=E. Finally, a calculation specified by the term containing si is carried out to move this 
truth-value to its desired position. 

When k=l, and both arguments of EQ are terms containing si, there is only nearly linear 
workspace, so the technique of separately computing the broadcasts for each variable of s 
and then computing a skew cannot be used. Now use the fact that the (unary) functions 
sue, DOUBLE, FTLOG and FTBIN are non-decreasing. As noted in the previous section. 
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the terms containing Si can be written using these functions alone. Since the domain LIN is 
finite, for each term E(si) there is a smallest element IgLIN such that for all Si>l, E(si)=E(l). 

First this element is identified for each of the two terms containing si. Let the two 
elements so obtained be li and h- W.l.o.g. assume li<l 2 . For l 2 <si<LINoo, the approach of 
comparing two essentially constant terms is used. For li<Si<l 2 , the approach of comparing 
a term containing si with an essentially constant term is used. It remains to describe the 
approach when LINl<si<li. 

In this range, all occurrences of FTLOG and FTBIN in the terms can be replaced 
by the identily function and the function DOUBLE does not saturate. The resulting 
simplified terms contain only SUC and DOUBLE and the occurrence EQ[Ei(si),E 2 (si)] 
describes a linear equation with one unknown si which is restricted to be a natural 
number less than Ij. It is clear that either this requirement is unsatisfiable (for ex- 
ample, EQ[si,SUC(si)] is false everywhere in the range LINl<si<li), or valid (for ex- 
ample, EQ[SUC(SUC(DOUBLE(si))),DOUBLE(SUC(si))] is true everywhere in the range 
LINl<si<li) or has a unique solution 1>1 in the natural numbers which depends only on the 
term and not the input. The update is easy after comparing this precomputed value 1 with 
the input-dependent number Ij. 

This completes the description of the algorithms for updating the auxiliary worktape 
contents in the indexing stage of the construction phase of the computation of DTM M on 
input X. It can be seen that M operates in 0(Voo”*Soo*') workspace and ©(Too’”) mode 
transitions for some m independent of the input size n and recognizes the language expressed 
by formula F, thus proving the theorem. 

I 



Chapter 5 


The main result 


In chapter 2, the sequential and parallel models of computation and the resources of interest 
were defined. Then, the resource bounded simulation of the DTM model on the PRAM 
model was presented. In chapter 3, the calculus FO+Y(T,S,V) extending first order logic was 
defined and a standard form was demonstrated. Then, the simulation of the PRAM model 
in the calculus was presented. The evaluation of formulas in standard form on the DTM 
model was presented in chapter 1. In this chapter, first, thc.se results are combined to give 
an equivalence between the DTM (with mode transitions) and the PRAM for conventional 
resource bounds (modulo polynomial factors). Then this equivalence is extended to other 
resource bounds. I'his eciuivalence holds for decision problems. The next section extends the 
equivalence to function problems. This shows that the class NC can be characterised in terms 
of mode transitions and workspace on the DTM. A unified invariance thesis and a conjecture 
are stated. Finally, two corollaries relating this thesis and conjecture to the first and second 
machine classes are proved. 

5.1 Equivalence modulo polynomial factors for Conventional 
Resource Bounds 

In chapter 2, conventional resource bounds for the D'l'M wore defiuod as follows: 
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Conventional workspace bounds s(n) and conventional mode transition bounds x(n) 
are those of the form 0(log(n)*'), 0(n'') and 0*(n*=) (for some constant k>l), 

with the restriction that either x(n)=n(n) and s(n)=0(poly(x(n))) or s(n)=fJ(n) and 
log(s(n))=0(po!y(x(n)+log(n))). 

The following two results wore proved: 

Theorem 5.1.1: 

For conventional tvork.fjxice bound s() and conventional mode transition bound x(), 
such that x(n)=i1(n) arid s(n)=0(poly(x(n))J, let M be a DTM that uses at most s(n) 
workspace and x(n) mode transitions on any input of length n and accepts language L. 
7'hen there is a PHAM algorithm that uses at most h(n)=0(s(n)+log(n)) hardware and 
t(n)=0(poly(x(n)+log(n))) parallel time and accepts L. 

Theorem 5.1.2: 

For conventional workspace bound s() and conventional mode transition bound x(), such 
that s(n}=Q.(n) and log(s(n))=0(poly(x(n)+log(n))), let M be a DTM that uses at most 
s(n) workspace and x(n) mode transitions on any input of length n and accepts language 
L. Then there is a PRAM algorithm that uses at most h(n)=0(s(n)+log(n)) hardware and 
t(n)=0(poly(x(n)+log(n))) parallel time and accepts L. 

In chapter 3, conventional resource bounds for the PRAM were defined as follows: 

Conventional hardware bounds h(n) and conventional parallel time bounds t(n) 
are those of the form O(log(n)^’), 0(n*‘’) and 0*(n^') (for some constant k>l), 
with the restriction that either t(n)=fi(n) and h(u)=0(poly(t(n))) or h(n)=Q(n) and 
log(h(n))=0(poly(t(n)+log(n))). 

The following two results were proved: 

Theorem 5.1.3: 

For T,S^{UN,WG) (such that 

T and S are not both TOC) and Ve{TOG,niN:Voo<SooAVoo<Too} , let A be a CROW 
PRAM decision algorithm written using only READ, WRITE, ADD, SUBTRACT, MOVE, 
INC/DEC, BIT and HATT instructions, uliich operates with 0(]X)ly(Voo*Soo)) hardware 
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and poly(Tco) parallel time. Thcji, there exists a formula of FO+y(T,S,V) which expresses 
Ike language decided by /I. 

Theorem 5.1.4: 

Given any formula of the calculus FO-i-Y(T,S,V) (such that T and S are not both LOG), 
one can construct an equivalent foj'uiula in standard form. 

In chapter 4, the following result was proved: 

'fheorem 5.1.5: 

Let F be a formula icith time domain T, .sjmcc domain S, variance domain V and space- 
arity k>l (such that T and S are not both LOG). Then, for some constant m depending on F 
but independent of the input size n, there is a DTM M operating in O(Voo’^Soo^) workspace 
and 0(To6^) mode transitions which recognizes the language expressed by F. 

Putting these results together, we obtain the following theorem: 

Theorem 5.1.6: 

For T,S€{LIN,L0G} (such that TandSare not both LOG) and V&{L0G,BIM} (such that 
Vbo<5oo and Voo< Too ), the calculus FO+ Y(T,S, V), the DTM restricted to 0(poly(Voo*Soo)) 
workspace and poly(Too) mode transitions, and the PRAM restricted to 0(poly(Voo*Soc)) 
hardware and poly(Too) ixirallcl time, all rccogjii.sc exactly the same class of languages. 

5.2 Equivalence modulo polynomial factors for Acceptable 
Resource Bounds 

To generalise the result of the previous section to other resource bounds, some restriction on 
the nature of resotirce bounds considered is necessary for the proof to go through. 
Definition 5.2ll: 

A simultaneous complexity model may be said to be a 3-tuple <M,Tm,Hm>, where 
M is a model of computation and T„. and H,n are two complexity measures on M. 
An <M,T,n,H,„>-computation is said to use at most (t,h) resources (equivalently, 
work within resource bound (t,h)) if it uses at most t(n) units of the resource T,„ 
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and at most h(n) units of the resource on inputs of length n. A resource bound 
(tyh) on is said to be constructible on the simultaneous complexity 

model if there is an <M,Tm^Hn,>-machine which on every input of length n uses 
at most (0(t),0(h)) resources and computes the string where z is the 

binary encoding of t(n). It is said to be acceptable if it is constructible on the 
simultaneous complexity model and one of the following two conditions holds: 

(1) t=SI(n) and h=0(poly(t)), 

(2) h=ft(n) and log(h)=0(poly(t+log(n))). 

Let Tpram and Hprom denote the resources parallel time and hardware respectively on the 
PRAM and let Xam and Le the resources mode transitions and workspace respectively 
on the DTM. Then <PRAM,TpronuHpram> and <DTM^rftTO>Sdtm> are simultaneous com- 
plexity models. It is clear that a pair of conventional resource bounds satisfying the two 
conditions above are constructible on the simultaneous complexity model and hence accept- 
able. Let such a pair be called a conventional pair. 

The results proved so far show that every <DTM,X*m,Sd:m>-computation taking (t,h) 
resources for some conventional pair (t,h) can be simulated by a <PRAM,TpromjHprom>- 
computation taking (poly(t-|-log(h)),0(h-|-log(t))) resources and every <PRAM,T prom»Hpron» 
computation taking (t,h) resources for some conventional pair (t,h) can be simulated by a 
<DTM,Xdtm,S<itm>-computation taking (poly(t-|-log(h)),poly(h+log(t))) resources. 

We would like to extend this from conventional resource bounds to all acceptable resource 
bounds. This requires the generalisation of each of the three simulations of chapters 2, 3 and 
4. These are considered in turn. 

The proofs of the theorems of chapter 2 simulating the <DTM,XjtnuS</tm> on the 
<PRAM,Tprom ,Hpram> Carry through without any change for acceptable resource bounds. 

To extend the result of chapter 3, four new domains HW, PT, LOGHW and LOGPT are 
added to the structures along with corresponding SUC, DOUBLE and typecasting functions, 
constants and predicate EQ. The domain HW by definition has a size equal to the hardware 
bound and the domain PT by definition has a size equal to the parallel time bound. The 
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doni!iins LOGHW and LOGPT by definition have sizes equal to the logarithm of the sizes of 
domains IfW and PT respectively. 'Phis gives us structures with seven domains, namely BIN, 
LIN, LOCi, HW, LOCiHW, PI and LOGPl. I'hc Y-operator is constrained to have time 
domain T=PT, space domain S=HW and variance domain VeBIN, LOGHW. With these 
mo<!ifications, the normal form results as well as the expre.ssion of < PR A M,Tprnm,Hpr am > 
computations in the calculus go through. 

To simulate this new calculus on the <DTM,Xdjm)S(f(m >5 the only extra work required 
is to show that the resource bound is constructible on the simultaneous complexity model. 
This, however, holds by assumption. 

5.3 Extension from Decision Problems to Function Problems 

To handle functions on the <DTM,X(i(„i,S(ff,n>, one needs an output tape. As usual, the 
output tape is one-way write-only and the space msed on the output tape is not counted. 
However, the mode transitions made on the output tape due to stops and starts are counted. 

On the <PRAM,Tpram,Hpram>, ouc uccds a writc-only area in global memory. The 
convention we follow is that negative addresses on global memory write operations refer to 
this area. Since the convention for calculating number of global memory cells used refers to 
the largest (most positive) address involved in a global memory operation, the locations used 
for output are implicitly ignored. The convention for interpreting the contents of negative 
memory as a string is as follows; The string is stored “backward”; i.e. the contents of 
location at address'-l are the leftmost. The string is encoded in the binary alphabet {01,11} 
and terminated by either 00 or 10. The contents of negative memory are assumed to be all Os 
initially. This convention ensures that unreasonably long output strings cannot be produced 
by writing to locations with large negative addresses (since the first location left unaccessed 
during the computation would then contain 00 or 10, thus terminating the string). 

In the calculus, we use free variables to encode the output string. An infinite sequence 
of variable symbols is (l('(in<“<l f)ver each of the <h»mainK. I lie rank of the symbols within 



CHAPTER 5. THE MAIN RESULT 


77 


the infinite soqueiico determines their relative significance, with a fixed order of precedence 
for variables over different domains. With this convention, a formula with these as the only 
free variables defines a predicate on each input structure, with a canonical lexicographic 
ordering over tuples. The truth values of this predicate when laid out as bits according to 
the lexicographic ordering of the tuples gives a binary string. This binary string is assumed 
to encode an output string with the binary alphabet {01,11} terminated by either 00, 10 or 
END-OF-STRING. 

' The simulation of a <DTM,Xrf(„,,Sjj„, > computing a function on the <PRAM,Tpram>Rprom> 
proceeds in the same way as in the case of language recognition. At the time of updating the 
tape contents, the number of symbols written during the sweep is cither 0 or equal to the 
length of the sweep. 

To express <PRAM,Tpram>Hpram> computations in the extended calculus, a number of 
free variables which just suffices to encode the <PRAM,Tpram 7 lIpram> output are used to 
build the formula. Since the output area is write-only, all the truth values required can be 
expressed by a single formula. 

The simulation of the calculus on the <DTM,X(;{,„,Sd(OT> is straightforward when the 
number of mode transitions pcrniilted is large enough to enable secpicntial simulation. Oth- 
erwise, the output string is produced in a series of iterations, in each of which segments of 
output are first obtained on a worktai)e, thou transferod to the output tape. The length 
of segments is linjited by the workspace bound and the same workspace is reused in each 
iteration. The decoding of the output string can be done in 0(1) mode transitions for each 
segment and the number of iterations is small. 


5.4 Reducing the Complexity Overhead of Simulation 

Thc-sc results show that when an algorithm A on one model that works within resource bound 
(h(n),t(n)) is simulated on another model, the simulation S(A) works within resource bound 
(O(h(n)-t-iog(n))*',()((t(n)+l<)g(n))*^)) for .some constant k. Hero, k appears to dcipciul on A. 
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However, it is easy to see that it can be made independent of A. 

Theorem 5.4.1: 

Consider both the sitnidation of DTMs on PR A Ms arid the simidation of PRAMs on the 
DTM. There exists an a priori constant k and a simulation method S such that for any 
algorithm A that works within resource bound (h(n),t(n)), the simulation S(A) works within 
resource bound (0(h(n) +log(n)}^ ,0( (t(nJ-i-log(n))'^ )). 

Proof: 

Each of the throe simuiations has to be considered in turn. The simulation of the 
<DTM,X(i{m)Scfim> on the <PRAM,Tpram,IIprom> goes through as presented in chapter 
2, with a careful complexity analysis. 

In chapter 3, the definition of the Y-operator has to be changed. Hitherto, we have allowed 
a variance domain for space but none at all for time. Now we require the variance domain 
for space to be BIN and introduce a variance domain for time, which should also be BIN, so 
that a location in space is represented by a tuple which is an element of BIN*'* while 

an instant of time is represented by a tuple which is an clement of BIN’''*PT‘'*, where ki, 
ka, Cl and C 2 are constants. The normal forms can be achieved without changing k 2 , and 
permitting increase in C 2 only to handle nesting. 

With these modifications, the increase in C 2 is bounded by the nesting depth of Y-operators 
and first-order quantifiers and the number of occurrences of variables of domain IIW in terms 
of type BIN. The assumption is made that <PRAM,Tpram,Hprom> algorithms consist of a 
single par-for-loop enclosed in a single seq-for-loop. This assumption is justified by recourse 
to a replication theorem [Blelloch’sUooklOOO]. With this assumption and the modifications 
mentioned above, we can show a constant k such that that the formula in standard form ex- 
pressing a <PRAM,Tpram,Hpram> Computation has space arity at most k times the exponent 
of the hardware complexity of the <PRAM,Tpram,Hpram> algorithm and time arity only an 
apriori fixed additive constant more than the exponent of the parallel time complexity of the 
<PRAM,Tpram,Hpram> algorithm. (Immerrnan has shown a particular degree of blowup (i.e. 
from c to 2c-f2) to ho .sufficient for the hardware [Immormanl987b].) 
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The extension of the simulation of chapter 4 to the modified calculus is straightforward. 
The constant k of the theorem is chosen <ls the maximum over all three simulations. 

I 


5.5 A Unified Invariance Thesis and a Conjecture 

In [VanEmdoUoaslIandbookI990], the cla.ssical Invariance Tliesis and the Parallel Computa- 
tion Thesis were presented in the following terms; 

Invariance Thesis: “Reasonable” machines can simulate each other within a polynomially 
bounded overhead in time and a constant- factor overhead in space. 

Parallel Computation Thesis. Whatever can be solved in polynomially bounded space on 
a reasonable sequential machine can be solved in polynomially bounded time on a reasonable 
parallel machine and vice versa. 

The orthodox interpretation (both resource restrictions apply simultaneously) was adopted 
for the Invariance Thesis. The neutral term machine class was proposed to replace the 
evaluative term reasonable and the first and second machine class were defined as follows: 

The first machine class consists of those sequential models which satisfy the Invariance 
Thesis with respect to the traditional Turing machine model. 

The second machine class consists of those (parallel or sequential) devices which satisfy 
the Parallel Computation Thesis with respect to the traditional, sequential Turing machine 
model. 

We propose a Unified Invariance Thesis in the following terms: 

Definition 5.5.1: 

A machine model is “reasonable” if every terminating M- 

computation that works within (t(n),h(n)) resources on inputs of size n 
works within (t(n),r(n)) resources for log(r(n))=0(t(n)-hlog(n)) and if every 
<M,T,njH,„>-computation taking (t,h) resources for some acceptable pair of 
resource bounds (t,h) can be simulated by a <DTM,Xd(m>S(fim>-computation 
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taking (poly(t+log(h)),0(h+log(t))) resources and every <DTM,Xji„„Sdi^>- 
computation taking (t,h) resources for some acceptable pair of resource 
bounds (t,h) can be simulated by an <M,T„i,Hm>-computation taking 
(poly(t4-log(h)),0(h+log(t))) resources. 

The following property holds for any “reasonable” model: 

Theorem 5.5.2: 

Let <M,Tm,Hm> be any “reasonable” simultaneous complexity model. Define the re- 
source Rm os Rm=Tm*(Hm'f'n), where n is the size of the input. Then the simultaneous 
complexity model <M,Rm,Hm> belongs to the first machine class. In particular, letting 
RdtTn=Xdim*(Sdtm+^)) <DTM,Rj,tm,Sdtm> Mongs to the first machine class. 

Proof: 

Again, we first prove the case for the kDTM, X^ tm, Sdtm> ^ then use the “reasonability” 
of the model in question to prove the general case. Since the resource workspace is common 
to the (classical) Invariance Thesis and the <DTM,X(f<„„Sdtm>-equivalent of the Unified 
Invariance Thesis, the only question is the relationship between runtime and the product of 
mode transitions and the sum of workspace and input size. It has been shown in chapter 2 that 
the product of mode transitions and the sum of workspace and input size is an upper bound 
for runtime. On the other hand, both the workspace and the number of mode transitions are 
bounded by the runtime. As per the (classical) Invariance Thesis, only those computations 
that read the entire input (and hence take fi(n) runtime) are relevant. Thus the square 
of the runtime is an upper bound (modulo constant multiplicative and additive factors) for 
the product of mode transitions and the sum of workspace and input size. This shows the 
polynomial equivalence of runtime and the new resource R„, and the simultaneous linear 
equivalence of workspace with itself, thus proving that the simultaneous complexity model 
<DTM,R<f(m,Srftm> belongs to the first machine class. It only remains to note that the 
“reasonability” of the simultaneous complexity model <M,T„i,H,n> together with the above 
result proves the theorem in the general case. 

I 
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We propose a coajecture in the following terms: 

Conjecture 5.5.3: 

<PRAM,TpramiHpram> ^ o “reasonable” simultaneous complexity model in the sense of 
the Unified Invariance Thesis. 

The following property holds for any “reasonable” model if the conjecture is true: 

Theorem 5.5.4: If the conjecture is true, then every “reasonable” simultaneous complex- 
ity model <M, Tm,Hm> belongs to the second machine class. In particular, <DTM,Xdtm)Sdtm> 
belongs to the second machine class. 

Proof: 

It has to be shown only that parallel time is polynomially equivalent to Hm, without 
any restriction on the other resource, namely hardware or Tm- But for <DTM,X*m,Scjt,n> 
this holds by virtue of the classical proof. So it is enough to prove that parallel time is 
polynomially equivalent to then use the “reasonability” of the <DTM,X*OT>Sd<m> and 
the <M,TTO,Hm> models in the sense of the Unified Invariance Thesis to show that Sdtm is 
linearly equivalent to Hm- Note that the Parallel Computation Thesis requires the equivalence 
to hold only for resource bounds that are Cl{log{n)). 

I 



Chapter 6 


Concluding Remarks 


6.1 The Meaning of “Mode” 

It is traditional to view a sequential computation as a history, and a parallel algorithm is 
obtained when this history is partitioned into a small number of large subsequences that 
are characterised by internal parallelizability and differentiated from each other by external 
sequentiality. Thus, the major task is the establishment of discontinuity. 

But “establishing discontinuities (in history) is not an easy task . . .We may wish to draw 
a dividing-line; but any limit we set may perhaps be no more than an arbitrary division 
made in a constantly mobile whole. We may wish to mark off a period,. . .(but) where, in 
that case, would the cause of its existence lie? Or that of its subsequent disappearance 
and fall? What rule could it be obeying by both its existence and its disappearance? If it 
contains a principle of coherence within itself, whence could come the foreign element capable 
of rebutting it?. . .(Discontinuity) begins with an erosion from outside, from that space which 
is, for thought, on the other side, but in which it has never ceased to think from the very 
beginning” [Foucaultl970] 

By choosing representations to characterise parallelizable periods of history a sequential 
computation is inverted to form a chain of organic structures; sequential time is transformed 
from a linear sequence of basic elements to a linear sequence of well-structured subsequences. 
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The “mode” of a tape represents an assumption of obliviousness that is not necessarily war- 
ranted, though it may hold for long periods of time before breaking down. It implies a 
separation of automaton state and behaviour from the irreducible richness of tape data and 
makes possible a concise and easy- to- compute predictive model that permits the simulator to 
eliminate sequential dependence within a sweep. Since the actual tape contents are ignored 
by the lookahead, an apparent discontinuity occurs when the delta function “behaves incon- 
sistently”. It is important to note the illusory nature of this inconsistency: the delta function 
does not even know that it has a mode; the discontinuity lies in the eye of the historian. 

The set of possible modes represents the ability of the simulator to theorise about patterns 
in sequential history. As such, it is interesting to question whether, and to what extent, the 
richness of this theory limits the ability to extract parallelism. The results presented in 
this work indicate that performance optimal modulo polylog factors can be obtained with a 
constant-memory real-time data-oblivious prediction rule. Can one do better by considering 
more complex rules? 

The intuitive notion of “predictive model” was mentioned above. The essence of prediction 
is avoiding the explosion of possible futures. This rough idea is used in theoretical computer 
science in at least three contexts including the present one: predictability could mean com- 
pressibility or determinism or parallelizability while unpredictability may mean randomness 
or alternation or inherent sequentiality. The relationship of parallelism to alternation is 
well-known. Is there a relationship between parallelism and compression? 

6.2 The Role of Descriptive Complexity 

The role of descriptive complexity as a “high level programming language” and as a method of 
simplifying the “global memory access pattern” has been discussed in chapter 1 and chapter 
3. Two further points need to be mentioned. First, the representation of resource bounds 
as syntactic attributes resulted in the elimination of the proof of efficiency as an entity 
distinct and separate from the proof of correctness of the algorithm which simulates PRAMs 
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on the TM. This may be contrasted to the proof (in chapter 2) of the algorithm which 
simulates TMs on the PRAM. Secondly, the reduction to standard form may be viewed as a 
Replication Theorem” [Blelloch’sBookl990]; the importance of such a theorem for practical 
parallel computing is well recognised. This suggests an important research area for descriptive 
complexity might be the development parallel programming languages that are designed to 
facilitate such a theorem. 

6.3 Functions Computed by Communication Networks 

That the problem of processing first order queries on relational databases is equivalent to 
the problem of communicating data between processors in a parallel computing system has 
been dramatically demonstrated in chapter 4 by using interconnection network techniques to 
implement relational transformations eflSlciently. While the concerns of computer engineers in 
designing such networks are of course far more elaborate, this work backs up the argument of 
[PippengerHandbookl990] that upper and lower bound results for computational problems 
might be obtained from corresponding bounds for communication networks even when the 
computational problems and the means used to solve them involve activities other than 
communication. Our choice of the term “basic redistributive functions” is intended to suggest 
that these may form the base functions in an hierarchy of functions similar to the primitive 
recursive functions. 

6.4 The Correspondence between External and Parallel Al- 
gorithms 

An amusing idea that is suggested by these results is that for decades researchers might 
have been doing research in parallel computing without being aware of the fact. The devel- 
opment of massive core memories is a recent phenomenon; the very word “core” refers to 
electric windings with cores of magnetic materials that used hysteresis to store a bit of data. 
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Databases were (and still are) stored on tape spools and access to the data was sequential. 
As a result, researchers concentrated on development of “file organisations” and “update al- 
gorithms” that simultaneously satisfied two constraints: the amount of tape used was to be 
kept low and the number of tape start, stop and rewind operations was to be kept low. The 
great practical importance of these “external algorithms” has ensured that they remain in 
the textbooks even though most algorithm designers dismiss them as obsolete. 

Now it is apparent that not only are they not obsolete; they are promising candidates for 
good parallel algorithms! Our results give a precise formal sense in which the correspondence 
between external and parallel algorithms holds. It is interesting that heapsort is as difficult 
to parallelise as it is to externalise, while mergesort has both external and parallel versions. 
The problem of designing a nearly linear cost NC algorithm for transitive closure is open; 
path finding in graphs is notoriously a difficult problem in database applications. Database 
researchers have traditionally been critical of algorithms designed to be efficient on the RAM. 
There was once an expectation that “file theory” would evolve as a separate area of research; 
unfortunately the lack of uniformity in external storage architectures prevented this. It is 
thus possible that the appropriate complexity theory for databases is parallel theory rather 
than sequential theory; good file organizations correspond to good parallel data structures 
and good external update algorithms correspond to good parallel update algorithms in a 
uniform way. This would mean that database researchers are in their own way experts in 
parallel computing whether they know it or not! 


6.5 Functional Dependencies and Recursive Query Process- 
ing 

Could the paradigms of parallel algorithm design find application in database query process- 
ing? An example is evaluation of recursive queries. Since a directed graph with vertices 
restricted to outdegree 1 may be thought of as a functional dependency, and very efficient 
parallel algorithms for finding least common ancestors in such graphs are known [Tsinl986], 
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this seems possible at least in some speccial cases. 

Recursive database queries can be defined by datalog programs or by fixpoint equations, 
both of which extend relational calculus. In the most general framework, such queries have 
high complexity. Therefore, it is important to recognize commonly occuring subclasses of 
recursions that can be evaluated efficiently using algorithms specially tailored to these sub- 
classes. Besides the case in which complete materialisation of recursive relations is required, 
it is often the case that in the query some selection is applied to the recursive relation. In 
general, it is not possible to transpose the selection and recursion operations, and complete 
materialisation of the recursive relation becomes necessary. In [SippuSoisalon-Soininenl988], 
a class of generalised binary composition operators was proposed. Operators in this class 
are defined in terms of relational operations, are associative and have the property that the 
transitive closure of a first normal form (INF) relation R with respect to such an operator can 
be computed in polynomial time by an algorithm which performs only 0(polylog(|R|)) file 
operations (with a suitable definition of “file operation” . For example a index or sort com- 
mand on a file of size n counts as O(log(n)) file operations while a sequence of next-record 
commands or a sequence of previous-record commands or a rewind command counts as one 
file operation.) However, when a selection is applied to the closure, apparently either the 
entire relation has to be materialised, or the number of file operations cannot be restricted 
to 0(polylog(lR|)). 

A selection query on the generalised transitive closure is defined as follows [SippuSoisalon- 
Soininenl988]: Let the attribute columns of relations be ordered so that one can use “$1” 
(“$2”,“$3” etc.) to refer to the 1** (respectively etc.) component of a tuple while 

writing selections and projections. Assume selection predicates F are such that given an 
operand $i of F, F may be written as (FiAF 2 ), where Fi contains all relational predicates 
involving $i and none of the others, and Fi is not always true. Let D=DixD 2 X- • -xD*,, 
k>l, be a finite product domain and let g be a binary operator on D; i.e. g maps pairs of 
relations over D into relations over D. g is a generalised composition if for all Ri and R 2 over 
D, g(Ri,R2)=7r$.i,$,-2,....$.*(<7F(RiXR-2)) and the following conditions hold: 
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(a) For all l<j<k, either ij=j or ij=k+j. 

(b) For all j, l<j<k, if $ij occurs in the selection predicate F, then F should be such that 
for the tuples selected by F, $j=$(k+j) is true. 

Let R+ denote the least fixpoint of the equation X=g(X,R)uR. Let B be a subset of R. 
The selection query Rg is defined as R5=R'^n7rj,j ... j,-^(BxR) and the selection query Rg 
is defined as Ri=R+n7rj.-,,ji,,...,S,-^(RxB) 

Since such selection queries occur frequently, it is worth exploring whether they can be 
evaluated more efficiently in special cases. Here we define a generalised functional dependency 
and give algorithms for the selection queries Rg and Rg which always work correctly, always 
perform only 0(polylog(|R|)) file operations and are very efficient when the dependency holds. 
Specifically, when the dependency holds, they work in time 0(|R|polylog(|R|)+|Bl*|R|). In 
the worst ca^e they have the same time complexity as the algorithm for materialising the 
entire closure. Let a relation R over domain D be said to satisfy a generalised functional 
dependency with respect to a generalised composition operator g (denoted gfd(D,g,R)) if 
for every subset B of R and for every relation R' over D, g(R',B) can be computed in 
0((|B|+lR'|)polylog(lBl+|R'|)) time using 0(polylog{|B|+lR'|)) file operations and the num- 
ber of tuples in g(R',B) is no more than the number of tuples in R'. 

Example 1: Consider a personnel database containing a relation PERSONNEL with at- 
tributes EMP-ID, ALMA-MATER and MNGR-ID representing respectively the identification 
code of an EMPLOYEE, the name of the UNIVERSITY at which the employee studied last 
and the identification code of the employee’s immediate supervisor. An employee has at most 
one immediate supervisor. All supervisors are employees. Given a relation OLDBOYS with 
attribute EMP-ID containing the identification codes of a set of employees, it is required to 
find ail pairs (X,Y) such that X is the id of an employee, Y is the id of an employee who is 
directly or indirectly a manager of X, all employees in the chain of command from Y to X 
including Y and X are from the same university and Y occurs in relation OLDBOYS. 

Let D be the domain of relation PERSONNEL and let g be the generalised composition 


operator 
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g{R\,'R2)=T^Ri.EMP-ID,Ri.ALMA-MATER,R-i.MNGR-lD 

{(^R^.EMP-ID=Ri .MNGR-ID/^R2 .ALMA-MATER=Ri.ALMA-MATEr{HiXR2))- 

It can be seen that gfd(D^, PERSONNEL) holds. Let B be the result of the query 
B=g(PERSONNEL, 

TT personnel.emp- id,personnel.alma-mater,personnel.mngr-id 

{<rpERSONNEL.EMP-ID=OLDBOYS.EMP-ID{0LDB0YSxPERS0NNEL))). 

Let PERSONNEL^ be a selection query as defined above. Then the required output can 
be obtained as 7r£;jvfp_/jr),MJVG/J-/DPERSONNELg. 

Example 2: Consider a personnel database containing a relation PERSONNEL with at- 
tributes EMP-ID, CADRE and MNGR-ID representing respectively the identification code 
of an EMPLOYEE, the cadre of the employee and the identification code of the employee’s 
immediate supervisor. An employee has at most one immediate supervisor. All supervisors 
are employees. The cadre is either LABOUR or MANAGEMENT. Each employee who has a 
supervisor is directly or indirectly managed by an employee of the MANAGEMENT cadre. 
For all employees who have a supervisor, the juniormost employee of the MANAGEMENT 
cadre who manages the employee writes the employee’s Annual Confidential Report. Given 
a relation TROUBLEMAKERS with attribute EMP-ID containing the identification codes 
of a set of employees, each having a supervisor, it is required to find all pairs (X,Y) such that 
X is the id of a troublemaker and Y is the id of the employee who writes the A.C.R. of X. 

Let D be the domain of relation PERSONNEL and let g be the generalised composition 
operator 

g[R\,R2)=‘^ Ri.EMP-ID,Ri.CADRE,Ri.MNGR-lD 

{<^R2.EMP- ID=Ri .MNGR- ID/\R2 .CADRE:^‘‘MANAGEMENT"{Hi X R 2 )) 

It can be seen that gfd(D,g, PERSONNEL) holds. Let B be the result of the query 

'Q=T^PERS0NNEL.EMP-ID,PERS0NNEL.CADRE,PERS0NNELMNGR-ID 

{cr person NEL.EMP-ID^TROUBLEMAKERS.EMP-ID 

(TROUBLEMAKERSxPERSONNEL)). 

Let R=PERS0NNEL;| be a selection query as defined above. Then the required output 
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can be obtained as 

’T R.EMP- ID,R.MNGR- ID 

{^PERSONNEL.EMP-ID=R.MNGR-ID^PERSONNEL.CADRE=“MANAGEMENT" 

(RxPERSONNEL)) 

The following algorithm evaluates the selection query Rg efficiently when the relation 
R satisfies a generalised functional dependency gfd(D,g,R): (Ai and A 2 are intermediate 
relations) 
begin 

1:= precompute an upper bound (say (k*|R|)^^ for INF relations R with k attribute 
columns) on the lengths of shortest derivations for those tuples for which a derivation exists; 
Ai:=A2:=R; 

for ('(/o 52(0)1 iterations do 
begin 

A2:=A2UAi; 

end; 

R+:=Ai:=B; 
while(Ai j^<^) do 
begin 

Ai;=g(Ai,A2)-Ai; 

Rj‘.=RgUAi; 

end; 

end. 

We call A 2 a temporal index for R to distinguish it from ordinary indexes which we call 
spatial indexes. 

The following algorithm evaluates the selection query R^ efficiently when the relation 
R satisfies a generalised functional dependency gfd(D,g,R)' (^1 ^2 intermediate 

relations) 



CHAPTER 6. CONCLUDING REMARKS 


90 


begin 

1:= precompute an upper bound (say (k*|R|)^* for INF relations R with k attribute 
columns) on the lengths of shortest derivations for those tuples for which a derivation exists; 
^i:=A2i=R; 

for f(/op2(0)l iterations do 
begin 

Ai:=g(Ai,Ai); 

A2>=A2UAi; 

end; 

Rj3:=Ai:=B; 
while(Ai5^^) do 
begin 

Ai:=g(A2,Ai)-Ai; 

R^:=RgUAi; 

end; 

end. 

Note that these algorithms work correctly with cyclic data. 

6.6 Query Language Extensions 

Let INCdtm be the class of problems solvable on DTMs using nearly linear workspace and 
polylog mode transitions. Let INCFjtm be the corresponding function class. Similarly, let 
INCpram be the class of problems solvable on PRAMs using nearly linear hardware and 
polylog parallel time. Let INCFpram be the corresponding function class. It is clear that 
the classes defined on the DTM are subsumed by the corresponding classes defined on the 
PRAM. If the conjecture of chapter 5 is true, this inclusion is an identity and algorithms in 
this class are efficiently portable to all “reasonable” machine models. The classes INCFdtm 
and INCFpram, along with LOGSPACEF, might prove useful while defining standards (com- 
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pleteness properties) for next generation query languages. While the need for extending the 
expressive power of query languages has been commented upon frequently, concern has fo- 
cused on the fact that even quadratic complexity is unacceptable for database applications. 
These classes contain many problems of database interest not expressible in relational calculus 
(order statistics, categorical aggregation, path-finding in rooted trees) and calculi expressing 
these classes may prove to have the appropriate amount of expressive power for database 
applications. (Principles of Independent interest like the Consistency Criterion would re- 
strict the expressive power further.) This would also promote efficient portability to parallel 
environments. 

Substantial research has been done already in the database field on new query languages 
that are more expressive than relationally complete languages. One of the most interesting of 
these, QBE, is very expressive, but is usually not implemented in full because it is too costly to 
do so. With hindsight, we can “explain” this by pointing out that QBE can express transitive 
closure of graphs, a query which is not known to be in either INCFprom or LOGSPACEF. (The 
best known NC algorithm, based on Boolean Matrix Multiplication, has 0*(m^'^)=0*(n^'^®) 
cost on graphs of m vertices, where the input, in adjacency matrix representation, is of size 
n=m^. To qualify for inclusion in INCFpram? this must be reduced to 0*(m^)=0*(n^). See 
[Coppersmith Winogr ad 1982]) 

The calculi developed in this study can in principle express queries which output rela- 
tions (by leaving some variables unbound). However, by this relations are only indirectly 
represented, and it would be difficult to interface the retrieval language to the other parts of 
a practical query language (for example, creation, indexing and update of persistent data). 
Also, the strong assumption of order that is made here for the purpose of proving the com- 
plexity result is unacceptable for real query languages. It is therefore desirable to develop 
equivalent (and far more user-friendly) calculi that name and manipulate relations in the 
style of SQL, QUEL and QBE. It should be possible to assign intentionally defined relations 
to relation names within the scope of an iteration operator. 
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6.7 Suggestions for Future Work 


On the descriptive complexity front, the following issues arise. 

The power of FO-|-Y(LOG,LOG,BIN) seems difficult to characterize. Is the calculus capa- 
ble of expressing PARITY and MAJORITY? It is clear that the class it expresses is subsumed 
by SCnNC and strictly subsumed by Simultaneous-PolyTime-LinSpace. 

The work of Abiteboul and Vianu [AbiteboulVianu 1991b] on generic complexity shows 
that the complexity of a query can grow wildly when total ordering is removed. It would 
be interesting to know which queries remain expressible in the calculus F0-1-Y(T,S,V) when 
some or all of the machinery is removed (for example, one could remove some/aU of the 
DOUBLE function, the SUC function, the BIN domain and the LOG domain) or replaced 
(say, by using the BIT predicate in place of the DOUBLE function). 

Two invariance notions on space bounding functions, corresponding to (S=LIN,V=LOG) 
and (Se{LIN,LOG},V=BIN) have been considered. Another invariance notion may be de- 
noted by ignoring the space arity. It may be noted that other notions like being within a 
poly(log+(.)) factor of equality, being within a poly(alpha(.)) factor of equality (alpha(.) is 
the inverse Ackerman function) that seem to occur in recently published algorithms can be 
expressed by introducing additional piggyback domains as in chapter 5. 

Characterization of calculi restricted in ncsting-dcpth*spacc-arity product or in nesting- 
depth*time- arity product, more elegant operators equivalent to the Y-operator but with lesser 
primitive machinery and the extension of this explicit representation approach to other classes 
are other possibilities for study. 

On the structural complexity front, our classes are uniform. Can similar non-uniform 
(P-uniform) class be defined and characterised? What relationships can be adduced between 
NC‘ (or LOGSPACE, which subsumes it) and (or INCpram, which subsumes it)? 

On the classical complexity front, the following issue arises: 

The results presented in this work, which arose from an attempt to unify the Invariance 
Thesis, the Parallel Computation Thesis, Pippengcr’s siinulation result and the work on 
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“efficient” sirnui3.tions between parallel models, are part of a (possibly hopeless) programme 
to recover the uniform, unfragmented perspective which computational complexity theory 
enjoyed in the 1960s because of which it was recognised both as a branch of mathematics and 
as a branch of engineering. An obvious next step is to incorporate the distinction between 
computing elements (feedforward) and persistent memory (feedback) and study serial case 
parallelism as triple resource bounded complexity. Does there exist a variant of Turing 
machines capable of handling the serial case? See also the Pipelined Computation Thesis of 
[ Wiedermannl992] . 

On the recursive query processing front, the notion of using (variants of) functional de- 
pendency constraints for efficient query evaluation can be extended to different subclasses of 
dataiog programs. Algorithms could be designed for use only when the constraint is known 
to hold or they could be designed to be always correct and to work more efficiently when the 
constraint happens to hold. 

On the query language extensions front, the complexity of database query language fea- 
tures needs to be studied. Since a total ordering is in general not available in end-user query 
languages, a collection of special-purpose operators like the unique and group by operators of 
SQL, aggregation functions, etc. are used instead. Studying the expressiveness/complexity 
trade-off of these operators, in isolation and in combination with iteration operators might 
lead to the development of more expressive yet practical query languages. 

While the need for fast business computing platforms is growing rapidly, parallel comput- 
ing platforms are suffering from a lack of non- scientific software applications. As database 
files often are organised and COBOL programs often are written quite deliberately to use 
nearly linear amounts of tape and to minimize the number of stops and starts by the tape 
drive, further work might lead to a way of porting the large established base of business 
computing software to parallel computing environments (by detecting parallelism in COBOL 
programs in the spirit of these results) with nearly the same cost and nearly optimal speedup. 
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